linux/include
Shaohua Li 9a46ad6d6d smp: make smp_call_function_many() use logic similar to smp_call_function_single()
I'm testing swapout workload in a two-socket Xeon machine.  The workload
has 10 threads, each thread sequentially accesses separate memory
region.  TLB flush overhead is very big in the workload.  For each page,
page reclaim need move it from active lru list and then unmap it.  Both
need a TLB flush.  And this is a multthread workload, TLB flush happens
in 10 CPUs.  In X86, TLB flush uses generic smp_call)function.  So this
workload stress smp_call_function_many heavily.

Without patch, perf shows:
+  24.49%  [k] generic_smp_call_function_interrupt
-  21.72%  [k] _raw_spin_lock
   - _raw_spin_lock
      + 79.80% __page_check_address
      + 6.42% generic_smp_call_function_interrupt
      + 3.31% get_swap_page
      + 2.37% free_pcppages_bulk
      + 1.75% handle_pte_fault
      + 1.54% put_super
      + 1.41% grab_super_passive
      + 1.36% __swap_duplicate
      + 0.68% blk_flush_plug_list
      + 0.62% swap_info_get
+   6.55%  [k] flush_tlb_func
+   6.46%  [k] smp_call_function_many
+   5.09%  [k] call_function_interrupt
+   4.75%  [k] default_send_IPI_mask_sequence_phys
+   2.18%  [k] find_next_bit

swapout throughput is around 1300M/s.

With the patch, perf shows:
-  27.23%  [k] _raw_spin_lock
   - _raw_spin_lock
      + 80.53% __page_check_address
      + 8.39% generic_smp_call_function_single_interrupt
      + 2.44% get_swap_page
      + 1.76% free_pcppages_bulk
      + 1.40% handle_pte_fault
      + 1.15% __swap_duplicate
      + 1.05% put_super
      + 0.98% grab_super_passive
      + 0.86% blk_flush_plug_list
      + 0.57% swap_info_get
+   8.25%  [k] default_send_IPI_mask_sequence_phys
+   7.55%  [k] call_function_interrupt
+   7.47%  [k] smp_call_function_many
+   7.25%  [k] flush_tlb_func
+   3.81%  [k] _raw_spin_lock_irqsave
+   3.78%  [k] generic_smp_call_function_single_interrupt

swapout throughput is around 1400M/s.  So there is around a 7%
improvement, and total cpu utilization doesn't change.

Without the patch, cfd_data is shared by all CPUs.
generic_smp_call_function_interrupt does read/write cfd_data several times
which will create a lot of cache ping-pong.  With the patch, the data
becomes per-cpu.  The ping-pong is avoided.  And from the perf data, this
doesn't make call_single_queue lock contend.

Next step is to remove generic_smp_call_function_interrupt() from arch
code.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21 17:22:20 -08:00
..
acpi Merge branch 'acpi-cleanup' 2013-02-15 13:58:30 +01:00
asm-generic The common clock framework changes for 3.9 are almost entirely fixes. 2013-02-20 11:02:10 -08:00
clocksource
crypto
drm Merge branch 'drm-intel-fixes' of git://people.freedesktop.org/~danvet/drm-intel 2013-01-11 07:52:48 +10:00
keys
linux smp: make smp_call_function_many() use logic similar to smp_call_function_single() 2013-02-21 17:22:20 -08:00
math-emu
media
memory
misc
net ipv6: fix race condition regarding dst->expires and dst->from. 2013-02-20 15:11:45 -05:00
pcmcia
ras
rdma UAPI: Remove empty Kbuild files 2013-01-02 17:36:10 -08:00
rxrpc
scsi SCSI misc on 20121212 2012-12-13 19:20:31 -08:00
sound Merge remote-tracking branch 'asoc/fix/cs4271' into tmp 2013-01-10 12:22:11 +00:00
target target: Introduce TCM_NO_SENSE 2013-01-10 20:06:08 -08:00
trace ACPI and power management updates for 3.9-rc1 2013-02-20 11:26:56 -08:00
uapi block: optionally snapshot page contents to provide stable pages during write 2013-02-21 17:22:20 -08:00
video video: s3c-fb: fix typo in definition of VIDCON1_VSTATUS_FRONTPORCH value 2013-02-21 17:22:18 -08:00
xen Bugfixes: 2012-12-18 12:26:54 -08:00
Kbuild UAPI: Remove empty Kbuild files 2013-01-02 17:36:10 -08:00