linux/kernel
Suresh Siddha 9000f05c6d sched: Fix SMT scheduler regression in find_busiest_queue()
Fix a SMT scheduler performance regression that is leading to a scenario
where SMT threads in one core are completely idle while both the SMT threads
in another core (on the same socket) are busy.

This is caused by this commit (with the problematic code highlighted)

   commit bdb94aa5db
   Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
   Date:   Tue Sep 1 10:34:38 2009 +0200

   sched: Try to deal with low capacity

   @@ -4203,15 +4223,18 @@ find_busiest_queue()
   ...
	for_each_cpu(i, sched_group_cpus(group)) {
   +	unsigned long power = power_of(i);

   ...

   -	wl = weighted_cpuload(i);
   +	wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
   +	wl /= power;

   -	if (rq->nr_running == 1 && wl > imbalance)
   +	if (capacity && rq->nr_running == 1 && wl > imbalance)
		continue;

On a SMT system, power of the HT logical cpu will be 589 and
the scheduler load imbalance (for scenarios like the one mentioned above)
can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
the weighted load with the power will result in "wl > imbalance" and
ultimately resulting in find_busiest_queue() return NULL, causing
load_balance() to think that the load is well balanced. But infact
one of the tasks can be moved to the idle core for optimal performance.

We don't need to use the weighted load (wl) scaled by the cpu power to
compare with  imabalance. In that condition, we already know there is only a
single task "rq->nr_running == 1" and the comparison between imbalance,
wl is to make sure that we select the correct priority thread which matches
imbalance. So we really need to compare the imabalnce with the original
weighted load of the cpu and not the scaled load.

But in other conditions where we want the most hammered(busiest) cpu, we can
use scaled load to ensure that we consider the cpu power in addition to the
actual load on that cpu, so that we can move the load away from the
guy that is getting most hammered with respect to the actual capacity,
as compared with the rest of the cpu's in that busiest group.

Fix it.

Reported-by: Ma Ling <ling.ma@intel.com>
Initial-Analysis-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1266023662.2808.118.camel@sbs-t61.sc.intel.com>
Cc: stable@kernel.org [2.6.32.x]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-02-16 15:13:59 +01:00
..
gcov microblaze: Enable GCOV_PROFILE_ALL 2009-09-21 14:29:21 +02:00
irq genirq: Convert irq_desc.lock to raw_spinlock 2009-12-14 23:55:33 +01:00
power vt: introduce and use vt_kmsg_redirect() function 2009-12-15 08:53:28 -08:00
time clocksource: Prevent potential kgdb dead lock 2010-01-26 14:53:16 +01:00
trace tracing/documentation: Cover new frame pointer semantics 2010-01-26 17:00:39 -05:00
.gitignore
acct.c bsdacct: fix uid/gid misreporting 2009-12-15 08:53:10 -08:00
async.c async: Fix lack of boot-time console due to insufficient synchronization 2009-06-08 12:31:53 -07:00
audit_tree.c fix more leaks in audit_tree.c tag_chunk() 2009-12-19 09:27:43 -08:00
audit_watch.c Audit: reorganize struct audit_watch to save 8 bytes 2009-09-24 03:50:25 -04:00
audit.c Audit: send signal info if selinux is disabled 2009-09-24 03:50:26 -04:00
audit.h Fix rule eviction order for AUDIT_DIR 2009-06-24 00:02:38 -04:00
auditfilter.c Audit: clean up all op= output to include string quoting 2009-06-24 00:00:52 -04:00
auditsc.c Sanitize f_flags helpers 2009-12-22 12:27:34 -05:00
backtracetest.c
bounds.c kbuild: move bounds.h to include/generated 2009-12-12 13:08:14 +01:00
capability.c remove CONFIG_SECURITY_FILE_CAPABILITIES compile option 2009-11-24 15:06:47 +11:00
cgroup_freezer.c cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time 2009-09-24 07:20:58 -07:00
cgroup.c cgroups: fix to return errno in a failure path 2010-02-02 18:11:22 -08:00
compat.c
configs.c
cpu.c sched: Correct printk whitespace in warning from cpu down task check 2010-01-28 06:59:55 +01:00
cpuset.c sched: Fix balance vs hotplug race 2009-12-06 21:10:56 +01:00
cred-internals.h
cred.c kernel/cred.c: use kmem_cache_free 2010-02-03 10:21:57 +11:00
delayacct.c headers: taskstats_kern.h trim 2009-09-18 09:48:52 -07:00
dma.c
exec_domain.c
exit.c do_wait() optimization: do not place sub-threads on task_struct->children list 2009-12-17 15:45:31 -08:00
extable.c
fork.c sched: Fix fork vs hotplug vs cpuset namespaces 2010-01-21 23:25:31 +01:00
freezer.c sched: fix nr_uninterruptible accounting of frozen tasks really 2009-07-18 14:19:53 +02:00
futex_compat.c futex: Fix compat_futex to be same as futex for REQUEUE_PI 2009-08-10 15:41:12 +02:00
futex.c futex: Handle futex value corruption gracefully 2010-02-03 15:13:22 +01:00
groups.c groups: move code to kernel/groups.c 2009-06-16 19:47:48 -07:00
hrtimer.c hrtimers: Convert to raw_spinlocks 2009-12-14 23:55:34 +01:00
hung_task.c softlockup: Fix hung_task_check_count sysctl 2009-11-27 06:21:57 +01:00
hw_breakpoint.c perf, hw_breakpoint, kgdb: Do not take mutex for kernel debugger 2010-01-30 08:42:21 +01:00
itimer.c itimers: Fix racy writes to cpu_itimer fields 2009-11-18 16:32:12 +01:00
kallsyms.c hw-breakpoints: Fix broken hw-breakpoint sample module 2009-11-10 11:23:29 +01:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks mutex: Better control mutex adaptive spinning config 2009-12-03 11:50:11 +01:00
Kconfig.preempt
kexec.c Merge git://git.infradead.org/~dwmw2/mtd-2.6.33 2010-01-24 10:31:34 -08:00
kfifo.c kfifo: fix kernel-doc notation 2010-02-02 18:11:21 -08:00
kgdb.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-02-04 16:07:41 -08:00
kmod.c kmod: fix resource leak in call_usermodehelper_pipe() 2010-01-11 09:34:04 -08:00
kprobes.c kprobes: Fix distinct type warning 2009-12-28 10:25:31 +01:00
ksysfs.c kexec: premit reduction of the reserved memory size 2009-12-16 07:20:13 -08:00
kthread.c sched: Move kthread_bind() back to kthread.c 2009-12-16 19:01:57 +01:00
latencytop.c
lockdep_internals.h lockdep: BFS cleanup 2009-07-24 10:53:29 +02:00
lockdep_proc.c seq_file: constify seq_operations 2009-09-23 07:39:29 -07:00
lockdep_states.h
lockdep.c lockdep: Fix check_usage_backwards() error message 2010-01-27 08:34:02 +01:00
Makefile Merge branch 'kvm-updates/2.6.33' of git://git.kernel.org/pub/scm/virt/kvm/kvm 2009-12-08 08:02:38 -08:00
module.c modules: Skip empty sections when exporting section notes 2010-01-06 01:11:29 -08:00
mutex-debug.c headers: remove sched.h from interrupt.h 2009-10-11 11:20:58 -07:00
mutex-debug.h locking: Implement new raw_spinlock 2009-12-14 23:55:32 +01:00
mutex.c mutex: Better control mutex adaptive spinning config 2009-12-03 11:50:11 +01:00
mutex.h
notifier.c kprobes: Fix to add __kprobes to notify_die 2009-08-30 03:08:26 +02:00
ns_cgroup.c cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time 2009-09-24 07:20:58 -07:00
nsproxy.c nsproxy: extract create_nsproxy() 2009-06-18 13:03:56 -07:00
panic.c kmsg_dump: Dump on crash_kexec as well 2009-12-31 19:45:04 +00:00
params.c tree-wide: convert open calls to remove spaces to skip_spaces() lib function 2009-12-15 08:53:32 -08:00
perf_event.c perf: Honour event state for aux stream data 2010-01-21 13:40:40 +01:00
pid_namespace.c pidns: deny CLONE_PARENT|CLONE_NEWPID combination 2009-09-24 07:21:04 -07:00
pid.c pid: reduce code size by using a pointer to iterate over array 2009-12-16 07:20:12 -08:00
pm_qos_params.c pm_qos: clean up racy global "name" variable 2009-10-14 15:31:10 +02:00
posix-cpu-timers.c posix-cpu-timers: optimize and document timer_create callback 2009-11-18 12:36:05 +01:00
posix-timers.c time: Introduce CLOCK_REALTIME_COARSE 2009-08-21 21:43:46 +02:00
printk.c Merge git://git.infradead.org/~dwmw2/mtd-2.6.33 2010-01-24 10:31:34 -08:00
profile.c kernel/profile.c: Switch /proc/irq/prof_cpu_mask to seq_file 2009-09-20 20:15:40 +02:00
ptrace.c ptrace: __ptrace_detach: do __wake_up_parent() if we reap the tracee 2009-09-24 07:20:59 -07:00
rcupdate.c rcu: Re-arrange code to reduce #ifdef pain 2009-11-22 18:58:16 +01:00
rcutiny.c rcu: Eliminate unneeded function wrapping 2009-11-22 18:58:16 +01:00
rcutorture.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
rcutree_plugin.h rcu: Add expedited grace-period support for preemptible RCU 2009-12-03 11:35:25 +01:00
rcutree_trace.c rcu: Add expedited grace-period support for preemptible RCU 2009-12-03 11:35:25 +01:00
rcutree.c rcu: Add expedited grace-period support for preemptible RCU 2009-12-03 11:35:25 +01:00
rcutree.h rcu: Add expedited grace-period support for preemptible RCU 2009-12-03 11:35:25 +01:00
relay.c const: constify remaining pipe_buf_operations 2009-12-16 07:20:05 -08:00
res_counter.c memcg: some modification to softlimit under hierarchical memory reclaim. 2009-10-01 16:11:13 -07:00
resource.c resources: fix call to alignf() in allocate_resource() 2009-12-21 10:42:29 -08:00
rtmutex_common.h
rtmutex-debug.c sched: Convert pi_lock to raw_spinlock 2009-12-14 23:55:33 +01:00
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c rtmutes: Convert rtmutex.lock to raw_spinlock 2009-12-14 23:55:33 +01:00
rtmutex.h
rwsem.c
sched_clock.c sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK 2009-12-15 09:04:36 +01:00
sched_cpupri.c sched: Convert cpupri lock to raw_spinlock 2009-12-14 23:55:33 +01:00
sched_cpupri.h sched: Convert cpupri lock to raw_spinlock 2009-12-14 23:55:33 +01:00
sched_debug.c sched: Convert rq->lock to raw_spinlock 2009-12-14 23:55:33 +01:00
sched_fair.c sched: Fix vmark regression on big machines 2010-01-21 13:39:03 +01:00
sched_features.h sched: Discard some old bits 2009-12-09 10:03:07 +01:00
sched_idletask.c sched: Restore printk sanity 2009-12-20 19:05:02 +01:00
sched_rt.c sched: Add pre and post wakeup hooks 2009-12-16 19:01:58 +01:00
sched_stats.h
sched.c sched: Fix SMT scheduler regression in find_busiest_queue() 2010-02-16 15:13:59 +01:00
seccomp.c
semaphore.c
signal.c kernel/signal.c: fix kernel information leak with print-fatal-signals=1 2010-01-11 09:34:05 -08:00
slow-work-debugfs.c SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
slow-work.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6 2009-12-08 07:38:50 -08:00
slow-work.h SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
smp.c smp_call_function_any(): pass the node value to cpumask_of_node() 2010-01-16 12:15:39 -08:00
softirq.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
softlockup.c softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume 2010-02-01 08:22:32 +01:00
spinlock.c locking: Cleanup the name space completely 2009-12-14 23:55:33 +01:00
srcu.c rcu: Add synchronize_srcu_expedited() 2009-10-26 09:40:30 +01:00
stacktrace.c
stop_machine.c
sys_ni.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 2009-12-08 07:55:01 -08:00
sys.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-12-19 09:47:34 -08:00
sysctl_binary.c SYSCTL: Print binary sysctl warnings (nearly) only once 2009-12-23 21:00:20 +01:00
sysctl_check.c ipv4 05/05: add sysctl to accept packets with local source addresses 2009-12-03 12:14:38 -08:00
sysctl.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 2009-12-17 16:58:26 -08:00
taskstats.c genetlink: make netns aware 2009-07-12 14:03:27 -07:00
test_kprobes.c
time.c Revert "time: Remove xtime_cache" 2009-12-22 14:10:37 -08:00
timeconst.pl
timer.c perf: Fix perf_event_do_pending() fallback callsite 2010-01-21 13:40:39 +01:00
tracepoint.c trivial: fix typo "to to" in multiple files 2009-09-21 15:14:55 +02:00
tsacct.c
uid16.c headers: utsname.h redux 2009-09-23 18:13:10 -07:00
up.c
user_namespace.c
user-return-notifier.c core: Clean up user return notifers use of per_cpu 2009-12-02 10:22:59 +01:00
user.c uids: Prevent tear down race 2009-11-02 16:02:39 +01:00
utsname_sysctl.c sysctl kernel: Remove binary sysctl logic 2009-11-12 02:04:55 -08:00
utsname.c utsns: extract creeate_uts_ns() 2009-06-18 13:03:55 -07:00
wait.c locking, sched: Give waitqueue spinlocks their own lockdep classes 2009-08-10 14:43:09 +02:00
workqueue.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2009-12-10 09:35:44 -08:00