linux/kernel
Frederic Weisbecker d9a3c9823a sched: Lower chances of cputime scaling overflow
Some users have reported that after running a process with
hundreds of threads on intensive CPU-bound loads, the cputime
of the group started to freeze after a few days.

This is due to how we scale the tick-based cputime against
the scheduler precise execution time value.

We add the values of all threads in the group and we multiply
that against the sum of the scheduler exec runtime of the whole
group.

This easily overflows after a few days/weeks of execution.

A proposed solution to solve this was to compute that multiplication
on stime instead of utime:
   62188451f0
   ("cputime: Avoid multiplication overflow on utime scaling")

The rationale behind that was that it's easy for a thread to
spend most of its time in userspace under intensive CPU-bound workload
but it's much harder to do CPU-bound intensive long run in the kernel.

This postulate got defeated when a user recently reported he was still
seeing cputime freezes after the above patch. The workload that
triggers this issue relates to intensive networking workloads where
most of the cputime is consumed in the kernel.

To reduce much more the opportunities for multiplication overflow,
lets reduce the multiplication factors to the remainders of the division
between sched exec runtime and cputime. Assuming the difference between
these shouldn't ever be that large, it could work on many situations.

This gets the same results as in the upstream scaling code except for
a small difference: the upstream code always rounds the results to
the nearest integer not greater to what would be the precise result.
The new code rounds to the nearest integer either greater or not
greater. In practice this difference probably shouldn't matter but
it's worth mentioning.

If this solution appears not to be enough in the end, we'll
need to partly revert back to the behaviour prior to commit
     0cf55e1ec0
     ("sched, cputime: Introduce thread_group_times()")

Back then, the scaling was done on exit() time before adding the cputime
of an exiting thread to the signal struct. And then we'll need to
scale one-by-one the live threads cputime in thread_group_cputime(). The
drawback may be a slightly slower code on exit time.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
2013-03-13 18:18:14 +01:00
..
debug KGDB/KDB fixes and cleanups 2013-03-02 08:31:39 -08:00
events hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
gcov kernel/gcov: remove depends on CONFIG_EXPERIMENTAL 2013-01-11 11:39:33 -08:00
irq Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-02-26 20:16:07 -08:00
power Driver core patches for 3.9-rc1 2013-02-21 12:05:51 -08:00
sched sched: Lower chances of cputime scaling overflow 2013-03-13 18:18:14 +01:00
time Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux 2013-02-28 19:48:26 -08:00
trace ImgTec Meta architecture changes for v3.9-rc1 2013-03-03 12:06:09 -08:00
.gitignore
acct.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-02-26 20:16:07 -08:00
async.c Merge branch 'for-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2013-02-19 22:01:33 -08:00
audit_tree.c audit: catch possible NULL audit buffers 2013-01-11 14:54:55 -08:00
audit_watch.c audit: catch possible NULL audit buffers 2013-01-11 14:54:55 -08:00
audit.c kernel/audit.c: avoid negative sleep durations 2013-01-11 14:54:56 -08:00
audit.h audit: optimize audit_compare_dname_path 2012-10-12 00:32:02 -04:00
auditfilter.c audit: fix auditfilter.c kernel-doc warnings 2013-01-10 14:35:23 -08:00
auditsc.c audit: catch possible NULL audit buffers 2013-01-11 14:54:55 -08:00
backtracetest.c
bounds.c
capability.c
cgroup_freezer.c cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free() 2012-11-19 08:13:38 -08:00
cgroup.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
compat.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2013-02-23 18:50:11 -08:00
configs.c
context_tracking.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-02-19 18:19:48 -08:00
cpu_pm.c
cpu.c Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-02-19 19:04:55 -08:00
cpuset.c Merge branch 'for-3.9-cpuset' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2013-02-20 09:18:31 -08:00
crash_dump.c
cred.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2012-12-18 10:55:28 -08:00
delayacct.c cputime: Use accessors to read task cputime stats 2013-01-27 19:23:31 +01:00
dma.c
elfcore.c
exec_domain.c
exit.c coredump: use a freezable_schedule for the coredump_finish wait 2013-02-27 19:10:11 -08:00
extable.c
fork.c cputime: Dynamically scale cputime for full dynticks accounting 2013-03-07 17:10:32 +01:00
freezer.c freezer: change ptrace_stop/do_signal_stop to use freezable_schedule() 2012-10-26 14:27:49 -07:00
futex_compat.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2013-02-23 18:50:11 -08:00
futex.c more file_inode() open-coded instances 2013-02-27 16:59:05 -05:00
groups.c
hrtimer.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-02-19 19:05:45 -08:00
hung_task.c
irq_work.c Merge branch 'nohz/printk-v8' into irq/core 2013-02-05 00:48:46 +01:00
itimer.c
jump_label.c jump_label: Export jump_label_rate_limit() 2012-08-06 19:00:35 +03:00
kallsyms.c
kcmp.c kcmp: include linux/ptrace.h 2012-12-20 17:40:19 -08:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks locking: Adjust spin lock inlining Kconfig options 2012-09-13 17:56:13 +02:00
Kconfig.preempt
kexec.c kexec: avoid freeing NULL pointer in image_crash_alloc() 2013-02-27 19:10:12 -08:00
kmod.c Merge branch 'master' into for-3.9-async 2013-01-23 09:31:01 -08:00
kprobes.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
ksysfs.c Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-12-11 18:10:49 -08:00
kthread.c kthread: use N_MEMORY instead N_HIGH_MEMORY 2012-12-12 17:38:33 -08:00
latencytop.c
lglock.c
lockdep_internals.h
lockdep_proc.c lockdep: Use KSYM_NAME_LEN'ed buffer for __get_key_name() 2012-10-24 12:39:09 +02:00
lockdep_states.h
lockdep.c lockdep: check that no locks held at freeze time 2013-02-27 19:10:11 -08:00
Makefile Merge branch 'akpm' (final batch from Andrew) 2013-02-27 20:58:09 -08:00
modsign_certificate.S MODSIGN: Avoid using .incbin in C source 2012-12-14 13:06:44 +10:30
modsign_pubkey.c keys: use keyring_alloc() to create module signing keyring 2012-12-20 17:40:21 -08:00
module_signing.c MODSIGN: Don't use enum-type bitfields in module signature info block 2012-12-05 11:27:24 +10:30
module-internal.h MODSIGN: Move the magic string to the end of a module and eliminate the search 2012-10-19 17:30:40 -07:00
module.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-02-26 20:16:07 -08:00
mutex-debug.c
mutex-debug.h
mutex.c sched/rt: Move rt specific bits into new header file 2013-02-07 20:51:08 +01:00
mutex.h
notifier.c
nsproxy.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-02-26 20:16:07 -08:00
padata.c padata: use __this_cpu_read per-cpu helper 2012-12-06 17:16:23 +08:00
panic.c taint: add explicit flag to show whether lock dep is still OK. 2013-01-21 17:17:57 +10:30
params.c
pid_namespace.c pidns: Stop pid allocation when init dies 2012-12-25 16:10:05 -08:00
pid.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
posix-cpu-timers.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-02-19 19:05:45 -08:00
posix-timers.c posix-timers: convert to idr_alloc() 2013-02-27 19:10:19 -08:00
printk.c Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux 2013-02-25 16:46:44 -08:00
profile.c profiling: Remove unused timer hook 2013-01-24 15:37:26 +01:00
ptrace.c uprobes: Add exports for module use 2013-02-08 17:47:13 +01:00
range.c
rcu.h rcu: Provide RCU CPU stall warnings for tiny RCU 2013-01-28 22:06:21 -08:00
rcupdate.c Merge branches 'doctorture.2013.01.29a', 'fixes.2013.01.26a', 'tagcb.2013.01.24a' and 'tiny.2013.01.29b' into HEAD 2013-01-28 22:25:21 -08:00
rcutiny_plugin.h rcu: Provide RCU CPU stall warnings for tiny RCU 2013-01-28 22:06:21 -08:00
rcutiny.c Merge branches 'doctorture.2013.01.29a', 'fixes.2013.01.26a', 'tagcb.2013.01.24a' and 'tiny.2013.01.29b' into HEAD 2013-01-28 22:25:21 -08:00
rcutorture.c rcu: Allow rcutorture to be built at low optimization levels 2013-02-04 12:18:20 -08:00
rcutree_plugin.h rcu: Make rcu_nocb_poll an early_param instead of module_param 2013-01-08 14:12:19 -08:00
rcutree_trace.c rcu: Separate accounting of callbacks from callback-free CPUs 2012-11-16 10:05:57 -08:00
rcutree.c Merge branches 'doctorture.2013.01.29a', 'fixes.2013.01.26a', 'tagcb.2013.01.24a' and 'tiny.2013.01.29b' into HEAD 2013-01-28 22:25:21 -08:00
rcutree.h Merge branches 'doctorture.2013.01.29a', 'fixes.2013.01.26a', 'tagcb.2013.01.24a' and 'tiny.2013.01.29b' into HEAD 2013-01-28 22:25:21 -08:00
relay.c new helper: file_inode(file) 2013-02-22 23:31:31 -05:00
res_counter.c res_counter: return amount of charges after res_counter_uncharge() 2012-12-18 15:02:12 -08:00
resource.c kernel/resource.c: fix stack overflow in __reserve_region_with_split() 2012-10-06 03:05:31 +09:00
rtmutex_common.h
rtmutex-debug.c sched/rt: Move rt specific bits into new header file 2013-02-07 20:51:08 +01:00
rtmutex-debug.h
rtmutex-tester.c sched/rt: Move rt specific bits into new header file 2013-02-07 20:51:08 +01:00
rtmutex.c sched/rt: Move rt specific bits into new header file 2013-02-07 20:51:08 +01:00
rtmutex.h
rwsem.c lockdep, rwsem: provide down_write_nest_lock() 2013-01-11 14:54:55 -08:00
seccomp.c seccomp: Make syscall skipping and nr changes more consistent 2012-10-02 21:14:29 +10:00
semaphore.c
signal.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2013-03-02 19:32:06 -08:00
smp.c smp: make smp_call_function_many() use logic similar to smp_call_function_single() 2013-02-21 17:22:20 -08:00
smpboot.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
smpboot.h smpboot: Provide infrastructure for percpu hotplug threads 2012-08-13 17:01:07 +02:00
softirq.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2013-02-20 18:58:50 -08:00
spinlock.c
srcu.c srcu: use ACCESS_ONCE() to access sp->completed in srcu_read_lock() 2013-02-07 15:19:36 -08:00
stacktrace.c
stop_machine.c stop_machine: Use smpboot threads 2013-02-14 15:29:38 +01:00
sys_ni.c module: add syscall to load module from fd 2012-12-14 13:05:22 +10:30
sys.c usermodehelper: cleanup/fix __orderly_poweroff() && argv_free() 2013-02-27 19:10:09 -08:00
sysctl_binary.c sysctl: fix null checking in bin_dn_node_address() 2013-02-27 19:10:21 -08:00
sysctl.c Initial ARC Linux port with some fixes on top for 3.9-rc1 2013-03-02 07:58:56 -08:00
task_work.c task_work: task_work_add() should not succeed after exit_task_work() 2012-09-13 16:47:34 +02:00
taskstats.c taskstats: cgroupstats_user_cmd() may leak on error 2012-10-06 03:05:31 +09:00
test_kprobes.c
time.c time: don't inline EXPORT_SYMBOL functions 2013-02-21 17:22:19 -08:00
timeconst.bc kernel: Replace timeconst.pl with a bc script 2013-02-16 23:17:25 +01:00
timer.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-02-19 18:19:48 -08:00
tracepoint.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
tsacct.c cputime: Use accessors to read task cputime stats 2013-01-27 19:23:31 +01:00
uid16.c
up.c
user_namespace.c userns: Allow any uid or gid mappings that don't overlap. 2013-01-26 22:12:04 -08:00
user-return-notifier.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
user.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
utsname_sysctl.c kernel/utsname_sysctl.c: put get/get_uts() into CONFIG_PROC_SYSCTL code block 2013-02-27 19:10:22 -08:00
utsname.c kernel/utsname.c: fix wrong comment about clone_uts_ns() 2013-02-27 19:10:22 -08:00
wait.c propagate name change to comments in kernel source 2012-12-06 10:39:54 +01:00
watchdog.c Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-02-22 19:25:09 -08:00
workqueue_internal.h workqueue: rename cpu_workqueue to pool_workqueue 2013-02-13 19:29:12 -08:00
workqueue.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00