linux

History

Nicolas Pitre 73e592f3bc ARM: 8504/1: __arch_xprod_64(): small optimization The tmp variable is used twice: first to pose as a register containing a value of zero, and then to provide a temporary register that initially is zero and get added some value. But somehow gcc decides to split those two usages in different registers. Example code: u64 div64const1000(u64 x) { u32 y = 1000; do_div(x, y); return x; } Result: div64const1000: push {r4, r5, r6, r7, lr} mov lr, #0 mov r6, r0 mov r7, r1 adr r5, .L8 ldrd r4, [r5] mov r1, lr umull r2, r3, r4, r6 cmn r2, r4 adcs r3, r3, r5 adc r2, lr, #0 umlal r3, r2, r5, r6 umlal r3, r1, r4, r7 mov r3, #0 adds r2, r1, r2 adc r3, r3, #0 umlal r2, r3, r5, r7 lsr r0, r2, #9 lsr r1, r3, #9 orr r0, r0, r3, lsl #23 pop {r4, r5, r6, r7, pc} .align 3 .L8: .word -1924145349 .word -2095944041 Full kernel build size: text data bss dec hex filename 13663814 1553940 351368 15569122 ed90e2 vmlinux Here the two instances of 'tmp' are assigned to r1 and lr. To avoid that, let's mark the first 'tmp' usage in __arch_xprod_64() with a "+r" constraint even if the register is not written to, so to create a dependency for the second usage with the effect of enforcing a single temporary register throughout. Result: div64const1000: push {r4, r5, r6, r7} movs r3, #0 adr r5, .L8 ldrd r4, [r5] umull r6, r7, r4, r0 cmn r6, r4 adcs r7, r7, r5 adc r6, r3, #0 umlal r7, r6, r5, r0 umlal r7, r3, r4, r1 mov r7, #0 adds r6, r3, r6 adc r7, r7, #0 umlal r6, r7, r5, r1 lsr r0, r6, #9 lsr r1, r7, #9 orr r0, r0, r7, lsl #23 pop {r4, r5, r6, r7} bx lr .align 3 .L8: .word -1924145349 .word -2095944041 text data bss dec hex filename 13663438 1553940 351368 15568746 ed8f6a vmlinux This time 'tmp' is assigned to r3 and used throughout. However, by being assigned to r3, that blocks usage of the r2-r3 double register slot for 64-bit values, forcing more registers to be spilled on the stack. Let's try to help it by forcing 'tmp' to the caller-saved ip register. Result: div64const1000: stmfd sp!, {r4, r5} mov ip, #0 adr r5, .L8 ldrd r4, [r5] umull r2, r3, r4, r0 cmn r2, r4 adcs r3, r3, r5 adc r2, ip, #0 umlal r3, r2, r5, r0 umlal r3, ip, r4, r1 mov r3, #0 adds r2, ip, r2 adc r3, r3, #0 umlal r2, r3, r5, r1 mov r0, r2, lsr #9 mov r1, r3, lsr #9 orr r0, r0, r3, asl #23 ldmfd sp!, {r4, r5} bx lr .align 3 .L8: .word -1924145349 .word -2095944041 text data bss dec hex filename 13662838 1553940 351368 15568146 ed8d12 vmlinux We could make the code marginally smaller yet by forcing 'tmp' to lr instead, but that would have a negative inpact on branch prediction for which "bx lr" is optimal. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>		2016-02-11 15:33:37 +00:00
..
alpha	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
arc	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
arm	ARM: 8504/1: __arch_xprod_64(): small optimization	2016-02-11 15:33:37 +00:00
arm64	ARM: SoC support for Tegra platforms for v4.5	2016-01-22 17:30:52 -08:00
avr32	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
blackfin	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
c6x	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
cris	Merge branch 'akpm' (patches from Andrew)	2016-01-21 12:32:08 -08:00
frv	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
h8300	Merge branch 'akpm' (patches from Andrew)	2016-01-21 12:32:08 -08:00
hexagon	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
ia64	[IA64] Enable copy_file_range syscall for ia64	2016-01-22 14:20:01 -08:00
m32r	m32r: fix m32104ut_defconfig build fail	2016-01-14 16:00:49 -08:00
m68k	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
metag	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
microblaze	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
mips	Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus	2016-01-24 12:50:56 -08:00
mn10300	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
nios2	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
openrisc	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
parisc	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
powerpc	wrappers for ->i_mutex access	2016-01-22 18:04:28 -05:00
s390	wrappers for ->i_mutex access	2016-01-22 18:04:28 -05:00
score
sh	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
sparc	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
tile	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
um	um: kill pfn_t	2016-01-15 17:56:32 -08:00
unicore32	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00
x86	pmem: add wb_cache_pmem() to the PMEM API	2016-01-22 17:02:18 -08:00
xtensa	dma-mapping: remove <asm-generic/dma-coherent.h>	2016-01-20 17:09:18 -08:00
.gitignore
Kconfig	dma-mapping: always provide the dma_map_ops based implementation	2016-01-20 17:09:18 -08:00