linux/kernel
Tim Chen f90cc919f9 sched/balance: Skip unnecessary updates to idle load balancer's flags
We observed that the overhead on trigger_load_balance(), now renamed
sched_balance_trigger(), has risen with a system's core counts.

For an OLTP workload running 6.8 kernel on a 2 socket x86 systems
having 96 cores/socket, we saw that 0.7% cpu cycles are spent in
trigger_load_balance(). On older systems with fewer cores/socket, this
function's overhead was less than 0.1%.

The cause of this overhead was that there are multiple cpus calling
kick_ilb(flags), updating the balancing work needed to a common idle
load balancer cpu. The ilb_cpu's flags field got updated unconditionally
with atomic_fetch_or().  The atomic read and writes to ilb_cpu's flags
causes much cache bouncing and cpu cycles overhead. This is seen in the
annotated profile below.

             kick_ilb():
             if (ilb_cpu < 0)
               test   %r14d,%r14d
             ↑ js     6c
             flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu));
               mov    $0x2d600,%rdi
               movslq %r14d,%r8
               mov    %rdi,%rdx
               add    -0x7dd0c3e0(,%r8,8),%rdx
             arch_atomic_read():
  0.01         mov    0x64(%rdx),%esi
 35.58         add    $0x64,%rdx
             arch_atomic_fetch_or():

             static __always_inline int arch_atomic_fetch_or(int i, atomic_t *v)
             {
             int val = arch_atomic_read(v);

             do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
  0.03  157:   mov    %r12d,%ecx
             arch_atomic_try_cmpxchg():
             return arch_try_cmpxchg(&v->counter, old, new);
  0.00         mov    %esi,%eax
             arch_atomic_fetch_or():
             do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
               or     %esi,%ecx
             arch_atomic_try_cmpxchg():
             return arch_try_cmpxchg(&v->counter, old, new);
  0.01         lock   cmpxchg %ecx,(%rdx)
 42.96       ↓ jne    2d2
             kick_ilb():

With instrumentation, we found that 81% of the updates do not result in
any change in the ilb_cpu's flags.  That is, multiple cpus are asking
the ilb_cpu to do the same things over and over again, before the ilb_cpu
has a chance to run NOHZ load balance.

Skip updates to ilb_cpu's flags if no new work needs to be done.
Such updates do not change ilb_cpu's NOHZ flags.  This requires an extra
atomic read but it is less expensive than frequent unnecessary atomic
updates that generate cache bounces.

We saw that on the OLTP workload, cpu cycles from trigger_load_balance()
(or sched_balance_trigger()) got reduced from 0.7% to 0.2%.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240531205452.65781-1-tim.c.chen@linux.intel.com
2024-06-05 15:52:36 +02:00
..
bpf The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
cgroup Misc fixes: 2024-05-19 11:38:15 -07:00
configs hardening updates for 6.10-rc1 2024-05-13 14:14:05 -07:00
debug kdb: Simplify management of tmpbuffer in kdb_read() 2024-04-26 17:13:31 +01:00
dma dma-mapping updates for Linux 6.10 2024-05-20 10:23:39 -07:00
entry entry: Respect changes to system call number by trace_sys_enter() 2024-03-12 13:23:32 +01:00
events The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
futex printk: Change type of CONFIG_BASE_SMALL to bool 2024-05-06 17:39:09 +02:00
gcov gcov: annotate struct gcov_iterator with __counted_by 2023-10-18 14:43:22 -07:00
irq genirq/irqdesc: Prevent use-after-free in irq_find_at_or_after() 2024-05-24 12:49:35 +02:00
kcsan kcsan, compiler_types: Introduce __data_racy type qualifier 2024-05-07 11:39:50 -07:00
livepatch livepatch: Rename KLP_* to KLP_TRANSITION_* 2024-05-09 15:48:01 +02:00
locking - Core Frameworks 2024-05-22 10:49:54 -07:00
module Driver core changes for 6.10-rc1 2024-05-22 12:13:40 -07:00
power getting rid of bogus set_blocksize() uses, switching it 2024-05-21 08:34:51 -07:00
printk TTY/Serial changes for 6.10-rc1 2024-05-22 11:53:02 -07:00
rcu Merge branches 'fixes.2024.04.15a', 'misc.2024.04.12a', 'rcu-sync-normal-improve.2024.04.15a', 'rcu-tasks.2024.04.15a' and 'rcutorture.2024.04.15a' into rcu-merge.2024.04.15a 2024-05-01 13:04:02 +02:00
sched sched/balance: Skip unnecessary updates to idle load balancer's flags 2024-06-05 15:52:36 +02:00
time sysctl changes for v6.10-rc1 2024-05-17 17:31:24 -07:00
trace tracing: Minor last minute fixes 2024-05-23 12:36:38 -07:00
.gitignore
acct.c kernel misc: Remove the now superfluous sentinel elements from ctl_table array 2024-04-24 09:43:53 +02:00
async.c async: Use a dedicated unbound workqueue with raised min_active 2024-02-09 11:13:59 -10:00
audit_fsnotify.c
audit_tree.c fsnotify: create a wrapper fsnotify_find_inode_mark() 2024-04-04 16:24:16 +02:00
audit_watch.c fsnotify: create a wrapper fsnotify_find_inode_mark() 2024-04-04 16:24:16 +02:00
audit.c audit: use KMEM_CACHE() instead of kmem_cache_create() 2024-01-25 10:12:22 -05:00
audit.h audit: correct audit_filter_inodes() definition 2023-07-21 12:17:25 -04:00
auditfilter.c audit: remove unnecessary assignment in audit_dupe_lsm_field() 2024-01-25 09:59:27 -05:00
auditsc.c audit,io_uring: io_uring openat triggers audit reference count underflow 2023-10-13 18:34:46 +02:00
backtracetest.c backtracetest: Convert from tasklet to BH workqueue 2024-02-05 13:22:34 -10:00
bounds.c bounds: Use the right number of bits for power-of-two CONFIG_NR_CPUS 2024-04-29 08:29:29 -07:00
capability.c lsm: constify the 'target' parameter in security_capget() 2023-08-08 16:48:47 -04:00
cfi.c
compat.c
configs.c
context_tracking.c context_tracking: Make context_tracking_key __ro_after_init 2024-03-22 11:18:18 +01:00
cpu_pm.c
cpu.c cgroup: Changes for v6.10 2024-05-15 17:06:08 -07:00
crash_core.c Mainly singleton patches, documented in their respective changelogs. 2024-05-19 14:02:03 -07:00
crash_reserve.c crash: add prefix for crash dumping messages 2024-05-08 08:41:26 -07:00
cred.c cred: Use KMEM_CACHE() instead of kmem_cache_create() 2024-02-23 17:33:31 -05:00
delayacct.c delayacct: Remove the now superfluous sentinel elements from ctl_table array 2024-04-24 09:43:54 +02:00
dma.c
elfcorehdr.c crash: remove dependency of FA_DUMP on CRASH_DUMP 2024-02-23 17:48:22 -08:00
exec_domain.c
exit.c virtio: features, fixes, cleanups 2024-05-23 12:04:36 -07:00
exit.h exit: add internal include file with helpers 2023-09-21 12:03:50 -06:00
extable.c
fail_function.c
fork.c mm: rename mm_put_huge_zero_page to mm_put_huge_zero_folio 2024-04-25 20:56:20 -07:00
freezer.c Linux 6.7-rc6 2023-12-23 15:52:13 +01:00
gen_kheaders.sh
groups.c groups: Convert group_info.usage to refcount_t 2023-09-29 11:28:39 -07:00
hung_task.c kernel misc: Remove the now superfluous sentinel elements from ctl_table array 2024-04-24 09:43:53 +02:00
iomem.c kernel/iomem.c: remove __weak ioremap_cache helper 2023-08-21 13:37:28 -07:00
irq_work.c
jump_label.c jump_label,module: Don't alloc static_key_mod for __ro_after_init keys 2024-03-22 11:18:16 +01:00
kallsyms_internal.h kallsyms: Avoid weak references for kallsyms symbols 2024-05-02 19:48:26 +09:00
kallsyms_selftest.c mm: vmalloc: enable memory allocation profiling 2024-04-25 20:55:57 -07:00
kallsyms_selftest.h
kallsyms.c kallsyms: Avoid weak references for kallsyms symbols 2024-05-02 19:48:26 +09:00
kcmp.c file: convert to SLAB_TYPESAFE_BY_RCU 2023-10-19 11:02:48 +02:00
Kconfig.freezer
Kconfig.hz
Kconfig.kexec crash: clean up kdump related config items 2024-02-23 17:48:22 -08:00
Kconfig.locks
Kconfig.preempt
kcov.c kcov: avoid clang out-of-range warning 2024-04-25 21:07:04 -07:00
kexec_core.c kernel misc: Remove the now superfluous sentinel elements from ctl_table array 2024-04-24 09:43:53 +02:00
kexec_elf.c
kexec_file.c crash: add a new kexec flag for hotplug support 2024-04-23 14:59:01 +10:00
kexec_internal.h crash: remove dependency of FA_DUMP on CRASH_DUMP 2024-02-23 17:48:22 -08:00
kexec.c crash: add a new kexec flag for hotplug support 2024-04-23 14:59:01 +10:00
kheaders.c
kprobes.c kprobe/ftrace: fix build error due to bad function definition 2024-05-17 19:17:55 -07:00
ksyms_common.c
ksysfs.c vmlinux: Avoid weak reference to notes section 2024-05-02 19:48:26 +09:00
kthread.c kunit: Handle test faults 2024-05-06 14:22:02 -06:00
latencytop.c kernel misc: Remove the now superfluous sentinel elements from ctl_table array 2024-04-24 09:43:53 +02:00
Makefile crash: split crash dumping code out from kexec_core.c 2024-02-23 17:48:22 -08:00
module_signature.c
notifier.c
nsproxy.c pidfd: add pidfs 2024-03-01 12:23:37 +01:00
numa.c kernel/numa.c: Move logging out of numa.h 2023-12-20 19:26:30 -05:00
padata.c padata: Disable BH when taking works lock on MT path 2024-04-12 15:07:51 +08:00
panic.c kernel misc: Remove the now superfluous sentinel elements from ctl_table array 2024-04-24 09:43:53 +02:00
params.c params: Fix multi-line comment style 2023-12-01 09:51:44 -08:00
pid_namespace.c kernel misc: Remove the now superfluous sentinel elements from ctl_table array 2024-04-24 09:43:53 +02:00
pid_sysctl.h kernel misc: Remove the now superfluous sentinel elements from ctl_table array 2024-04-24 09:43:53 +02:00
pid.c pidfs: remove config option 2024-03-13 12:53:53 -07:00
profile.c profiling: Remove create_prof_cpu_mask(). 2024-04-27 11:17:48 -07:00
ptrace.c ptrace_attach: shift send(SIGSTOP) into ptrace_set_stopped() 2024-02-22 15:38:52 -08:00
range.c
reboot.c kernel misc: Remove the now superfluous sentinel elements from ctl_table array 2024-04-24 09:43:53 +02:00
regset.c regset: use kvzalloc() for regset_get_alloc() 2024-04-25 21:07:03 -07:00
relay.c kernel: relay: remove relay_file_splice_read dead code, doesn't work 2023-12-29 12:22:27 -08:00
resource_kunit.c
resource.c Quite a lot of kexec work this time around. Many singleton patches in 2024-01-09 11:46:20 -08:00
rseq.c
scftorture.c scftorture: Pause testing after memory-allocation failure 2023-07-14 15:02:57 -07:00
scs.c
seccomp.c sysctl changes for v6.10-rc1 2024-05-17 17:31:24 -07:00
signal.c virtio: features, fixes, cleanups 2024-05-23 12:04:36 -07:00
smp.c CSD lock commits for v6.7 2023-10-30 17:56:53 -10:00
smpboot.c kthread: add kthread_stop_put 2023-10-04 10:41:57 -07:00
smpboot.h
softirq.c softirq: Fix suspicious RCU usage in __do_softirq() 2024-04-29 05:03:51 +02:00
stackleak.c sysctl changes for v6.10-rc1 2024-05-17 17:31:24 -07:00
stacktrace.c stacktrace: fix kernel-doc typo 2023-12-29 12:22:29 -08:00
static_call_inline.c
static_call.c
stop_machine.c
sys_ni.c mseal: wire up mseal syscall 2024-05-23 19:40:26 -07:00
sys.c RISC-V Patches for the 6.10 Merge Window, Part 1 2024-05-22 09:56:00 -07:00
sysctl-test.c
sysctl.c kernel misc: Remove the now superfluous sentinel elements from ctl_table array 2024-04-24 09:43:53 +02:00
task_work.c task_work: add kerneldoc annotation for 'data' argument 2023-09-19 13:21:32 -07:00
taskstats.c taskstats: fill_stats_for_tgid: use for_each_thread() 2023-10-04 10:41:57 -07:00
torture.c torture: Print out torture module parameters 2023-09-24 17:24:01 +02:00
tracepoint.c
tsacct.c
ucount.c sysctl changes for v6.10-rc1 2024-05-17 17:31:24 -07:00
uid16.c
uid16.h
umh.c umh: Remove the now superfluous sentinel elements from ctl_table array 2024-04-24 09:43:53 +02:00
up.c smp: Change function signatures to use call_single_data_t 2023-09-13 14:59:24 +02:00
user_namespace.c user_namespace: remove unnecessary NULL values from kbuf 2024-02-22 15:38:52 -08:00
user-return-notifier.c
user.c printk: Change type of CONFIG_BASE_SMALL to bool 2024-05-06 17:39:09 +02:00
usermode_driver.c
utsname_sysctl.c kernel misc: Remove the now superfluous sentinel elements from ctl_table array 2024-04-24 09:43:53 +02:00
utsname.c
vhost_task.c vhost_task: Handle SIGKILL by flushing work and exiting 2024-05-22 08:31:15 -04:00
vmcore_info.c mm: free up PG_slab 2024-04-25 20:56:00 -07:00
watch_queue.c watch_queue: fix kcalloc() arguments order 2023-12-21 13:17:54 +01:00
watchdog_buddy.c watchdog/hardlockup: move SMP barriers from common code to buddy code 2023-06-19 16:25:28 -07:00
watchdog_perf.c kernel/watchdog_perf.c: tidy up kerneldoc 2024-05-08 08:41:29 -07:00
watchdog.c Mainly singleton patches, documented in their respective changelogs. 2024-05-19 14:02:03 -07:00
workqueue_internal.h workqueue: Drop the special locking rule for worker->flags and worker_pool->flags 2023-08-07 15:57:22 -10:00
workqueue.c Merge branch 'for-6.10' into test-merge-for-6.10 2024-05-15 11:40:33 -10:00