linux/arch/x86/lib
Ma Ling 59daa706fb x86, mem: Optimize memcpy by avoiding memory false dependece
All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi),	%rax
2. movq %rax,	(%rdi)
3. movq 8(%rsi), %rax
4. movq %rax,	8(%rdi)

If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.

1. movq 8(%rsi), %rax
2. movq %rax,	8(%rdi)
3. movq (%rsi),	%rax
4. movq %rax,	(%rdi)

Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence.  At last on Core2 we gain 1.83x speedup compared with
original instruction sequence.  In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7.  (We
use our micro-benchmark, and will do further test according to your
requirment)

Signed-off-by: Ma Ling <ling.ma@intel.com>
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-08-23 14:56:41 -07:00
..
.gitignore x86: Gitignore: arch/x86/lib/inat-tables.c 2009-11-04 13:11:28 +01:00
atomic64_32.c x86-32: Rewrite 32-bit atomic64 functions in assembly 2010-02-25 20:47:30 -08:00
atomic64_386_32.S x86, asm: Use a lower case name for the end macro in atomic64_386_32.S 2010-08-12 07:04:16 -07:00
atomic64_cx8_32.S x86-32: Fix atomic64_inc_not_zero return value convention 2010-03-01 11:39:03 -08:00
cache-smp.c x86, lib: Add wbinvd smp helpers 2010-01-22 16:05:42 -08:00
checksum_32.S
clear_page_64.S x86, alternatives: Use 16-bit numbers for cpufeature index 2010-07-07 10:36:28 -07:00
cmpxchg8b_emu.S x86: Provide an alternative() based cmpxchg64() 2009-09-30 22:55:59 +02:00
cmpxchg.c x86, asm: Merge cmpxchg_486_u64() and cmpxchg8b_emu() 2010-07-28 17:05:11 -07:00
copy_page_64.S x86, alternatives: Use 16-bit numbers for cpufeature index 2010-07-07 10:36:28 -07:00
copy_user_64.S x86, alternatives: Fix one more open-coded 8-bit alternative number 2010-07-13 14:56:16 -07:00
copy_user_nocache_64.S x86: wrong register was used in align macro 2008-07-30 10:10:39 -07:00
csum-copy_64.S
csum-partial_64.c x86: fix csum_partial() export 2008-05-13 19:38:47 +02:00
csum-wrappers_64.c
delay.c x86, delay: tsc based udelay should have rdtsc_barrier 2009-06-25 16:47:40 -07:00
getuser.S x86: use _types.h headers in asm where available 2009-02-13 11:35:01 -08:00
inat.c x86: AVX instruction set decoder support 2009-10-29 08:47:46 +01:00
insn.c x86: AVX instruction set decoder support 2009-10-29 08:47:46 +01:00
iomap_copy_64.S
Makefile x86, asm: Move cmpxchg emulation code to arch/x86/lib 2010-07-28 16:53:49 -07:00
memcpy_32.c x86, mem: Optimize memcpy by avoiding memory false dependece 2010-08-23 14:56:41 -07:00
memcpy_64.S x86, mem: Optimize memcpy by avoiding memory false dependece 2010-08-23 14:56:41 -07:00
memmove_64.c x86, mem: Don't implement forward memmove() as memcpy() 2010-08-23 14:14:27 -07:00
memset_64.S x86, alternatives: Use 16-bit numbers for cpufeature index 2010-07-07 10:36:28 -07:00
mmx_32.c x86: clean up mmx_32.c 2008-04-17 17:40:47 +02:00
msr-reg-export.c x86, msr: change msr-reg.o to obj-y, and export its symbols 2009-09-04 10:00:09 -07:00
msr-reg.S x86, msr: Fix msr-reg.S compilation with gas 2.16.1, on 32-bit too 2009-09-03 21:26:34 +02:00
msr-smp.c x86, msr: msrs_alloc/free for CONFIG_SMP=n 2009-12-16 15:36:32 -08:00
msr.c x86, msr: msrs_alloc/free for CONFIG_SMP=n 2009-12-16 15:36:32 -08:00
putuser.S x86: merge putuser asm functions. 2008-07-09 09:14:13 +02:00
rwlock_64.S
rwsem_64.S Fix the x86_64 implementation of call_rwsem_wait() 2010-05-04 15:24:14 -07:00
semaphore_32.S
string_32.c x86: coding style fixes to arch/x86/lib/string_32.c 2008-08-15 16:53:25 +02:00
strstr_32.c x86: coding style fixes to arch/x86/lib/strstr_32.c 2008-08-15 16:53:24 +02:00
thunk_32.S ftrace: trace irq disabled critical timings 2008-05-23 20:32:46 +02:00
thunk_64.S ftrace: trace irq disabled critical timings 2008-05-23 20:32:46 +02:00
usercopy_32.c x86: Turn the copy_from_user check into an (optional) compile time warning 2009-10-01 11:31:04 +02:00
usercopy_64.c x86, 64-bit: Clean up user address masking 2009-06-20 15:40:00 -07:00
x86-opcode-map.txt x86: Add Intel FMA instructions to x86 opcode map 2009-10-29 08:47:47 +01:00