e799de3efb
5 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
|
0ade34c370 |
lib: optimize cpumask_next_and()
We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and(). It's essentially a joined iteration in search for a non-zero bit, which is currently implemented as a lookup join (find a nonzero bit on the lhs, lookup the rhs to see if it's set there). Implement a direct join (find a nonzero bit on the incrementally built join). Also add generic bitmap benchmarks in the new `test_find_bit` module for new function (see `find_next_and_bit` in [2] and [3] below). For cpumask_next_and, direct benchmarking shows that it's 1.17x to 14x faster with a geometric mean of 2.1 on 32 CPUs [1]. No impact on memory usage. Note that on Arm, the new pure-C implementation still outperforms the old one that uses a mix of C and asm (`find_next_bit`) [3]. [1] Approximate benchmark code: ``` unsigned long src1p[nr_cpumask_longs] = {pattern1}; unsigned long src2p[nr_cpumask_longs] = {pattern2}; for (/*a bunch of repetitions*/) { for (int n = -1; n <= nr_cpu_ids; ++n) { asm volatile("" : "+rm"(src1p)); // prevent any optimization asm volatile("" : "+rm"(src2p)); unsigned long result = cpumask_next_and(n, src1p, src2p); asm volatile("" : "+rm"(result)); } } ``` Results: pattern1 pattern2 time_before/time_after 0x0000ffff 0x0000ffff 1.65 0x0000ffff 0x00005555 2.24 0x0000ffff 0x00001111 2.94 0x0000ffff 0x00000000 14.0 0x00005555 0x0000ffff 1.67 0x00005555 0x00005555 1.71 0x00005555 0x00001111 1.90 0x00005555 0x00000000 6.58 0x00001111 0x0000ffff 1.46 0x00001111 0x00005555 1.49 0x00001111 0x00001111 1.45 0x00001111 0x00000000 3.10 0x00000000 0x0000ffff 1.18 0x00000000 0x00005555 1.18 0x00000000 0x00001111 1.17 0x00000000 0x00000000 1.25 ----------------------------- geo.mean 2.06 [2] test_find_next_bit, X86 (skylake) [ 3913.477422] Start testing find_bit() with random-filled bitmap [ 3913.477847] find_next_bit: 160868 cycles, 16484 iterations [ 3913.477933] find_next_zero_bit: 169542 cycles, 16285 iterations [ 3913.478036] find_last_bit: 201638 cycles, 16483 iterations [ 3913.480214] find_first_bit: 4353244 cycles, 16484 iterations [ 3913.480216] Start testing find_next_and_bit() with random-filled bitmap [ 3913.481074] find_next_and_bit: 89604 cycles, 8216 iterations [ 3913.481075] Start testing find_bit() with sparse bitmap [ 3913.481078] find_next_bit: 2536 cycles, 66 iterations [ 3913.481252] find_next_zero_bit: 344404 cycles, 32703 iterations [ 3913.481255] find_last_bit: 2006 cycles, 66 iterations [ 3913.481265] find_first_bit: 17488 cycles, 66 iterations [ 3913.481266] Start testing find_next_and_bit() with sparse bitmap [ 3913.481272] find_next_and_bit: 764 cycles, 1 iterations [3] test_find_next_bit, arm (v7 odroid XU3). [ 267.206928] Start testing find_bit() with random-filled bitmap [ 267.214752] find_next_bit: 4474 cycles, 16419 iterations [ 267.221850] find_next_zero_bit: 5976 cycles, 16350 iterations [ 267.229294] find_last_bit: 4209 cycles, 16419 iterations [ 267.279131] find_first_bit: 1032991 cycles, 16420 iterations [ 267.286265] Start testing find_next_and_bit() with random-filled bitmap [ 267.302386] find_next_and_bit: 2290 cycles, 8140 iterations [ 267.309422] Start testing find_bit() with sparse bitmap [ 267.316054] find_next_bit: 191 cycles, 66 iterations [ 267.322726] find_next_zero_bit: 8758 cycles, 32703 iterations [ 267.329803] find_last_bit: 84 cycles, 66 iterations [ 267.336169] find_first_bit: 4118 cycles, 66 iterations [ 267.342627] Start testing find_next_and_bit() with sparse bitmap [ 267.356919] find_next_and_bit: 91 cycles, 1 iterations [courbet@google.com: v6] Link: http://lkml.kernel.org/r/20171129095715.23430-1-courbet@google.com [geert@linux-m68k.org: m68k/bitops: always include <asm-generic/bitops/find.h>] Link: http://lkml.kernel.org/r/1512556816-28627-1-git-send-email-geert@linux-m68k.org Link: http://lkml.kernel.org/r/20171128131334.23491-1-courbet@google.com Signed-off-by: Clement Courbet <courbet@google.com> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Yury Norov <ynorov@caviumnetworks.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
066def56dc |
m68k/bitops: Correct signature of test_bit()
mm/filemap.c: In function ‘clear_bit_unlock_is_negative_byte’: mm/filemap.c:933: warning: passing argument 2 of ‘test_bit’ discards qualifiers from pointer target type Make the bitmask pointed to by the "vaddr" parameter volatile to fix this, like is done on other architectures. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> |
||
|
2db56e8606 |
arch,m68k: Convert smp_mb__*()
m68k uses asm-generic/barrier.h and its smp_mb() is barrier(), therefore we can use the generic versions that use smp_mb(). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-s5dvosrb7qhvpmtaffwfn0zg@git.kernel.org Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Cc: linux-m68k@lists.linux-m68k.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
171d809df1 |
m68k: merge mmu and non-mmu bitops.h
The following patch merges the mmu and non-mmu versions of the m68k bitops.h files. Now there is a good deal of difference between the two files, but none of it is actually an mmu specific difference. It is all about the specific m68k/coldfire varient we are targeting. So it makes an awful lot of sense to merge these into a single bitops.h. There is a number of ways I can see to factor this code. The approach I have taken here is to keep the various versions of each macro/function type together. This means that there is some ifdefery with each to handle each CPU type. I have added some comments in a couple of appropriate places to try and make it clear what the differences we are dealing with are. Specifically the instruction and addressing mode differences we have to deal with. The merged form keeps the same underlying optimizations for each CPU type for all the general bit clear/set/change and find bit operations. It does switch to using the generic le operations though, instead of any local varients. Build tested on ColdFire, 68328, 68360 (which is cpu32) and 68020+. Run tested on ColdFire and ARAnyM. Signed-off-by: Greg Ungerer <gerg@uclinux.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> |
||
|
49148020bc |
m68k,m68knommu: merge header files
Merge header files for m68k and m68knommu to the single location: arch/m68k/include/asm The majority of this patch was the result of the script that is included in the changelog below. The script was originally written by Arnd Bergman and exten by me to cover a few more files. When the header files differed the script uses the following: The original m68k file is named <file>_mm.h [mm for memory manager] The m68knommu file is named <file>_no.h [no for no memory manager] The files uses the following include guard: This include gaurd works as the m68knommu toolchain set the __uClinux__ symbol - so this should work in userspace too. Merging the header files for m68k and m68knommu exposes the (unexpected?) ABI differences thus it is easier to actually identify these and thus to fix them. The commit has been build tested with both a m68k and a m68knommu toolchain - with success. The commit has also been tested with "make headers_check" and this patch fixes make headers_check for m68knommu. The script used: TARGET=arch/m68k/include/asm SOURCE=arch/m68knommu/include/asm INCLUDE="cachectl.h errno.h fcntl.h hwtest.h ioctls.h ipcbuf.h \ linkage.h math-emu.h md.h mman.h movs.h msgbuf.h openprom.h \ oplib.h poll.h posix_types.h resource.h rtc.h sembuf.h shmbuf.h \ shm.h shmparam.h socket.h sockios.h spinlock.h statfs.h stat.h \ termbits.h termios.h tlb.h types.h user.h" EQUAL="auxvec.h cputime.h device.h emergency-restart.h futex.h \ ioctl.h irq_regs.h kdebug.h local.h mutex.h percpu.h \ sections.h topology.h" NOMUUFILES="anchor.h bootstd.h coldfire.h commproc.h dbg.h \ elia.h flat.h m5206sim.h m520xsim.h m523xsim.h m5249sim.h \ m5272sim.h m527xsim.h m528xsim.h m5307sim.h m532xsim.h \ m5407sim.h m68360_enet.h m68360.h m68360_pram.h m68360_quicc.h \ m68360_regs.h MC68328.h MC68332.h MC68EZ328.h MC68VZ328.h \ mcfcache.h mcfdma.h mcfmbus.h mcfne.h mcfpci.h mcfpit.h \ mcfsim.h mcfsmc.h mcftimer.h mcfuart.h mcfwdebug.h \ nettel.h quicc_simple.h smp.h" FILES="atomic.h bitops.h bootinfo.h bug.h bugs.h byteorder.h cache.h \ cacheflush.h checksum.h current.h delay.h div64.h \ dma-mapping.h dma.h elf.h entry.h fb.h fpu.h hardirq.h hw_irq.h io.h \ irq.h kmap_types.h machdep.h mc146818rtc.h mmu.h mmu_context.h \ module.h page.h page_offset.h param.h pci.h pgalloc.h \ pgtable.h processor.h ptrace.h scatterlist.h segment.h \ setup.h sigcontext.h siginfo.h signal.h string.h system.h swab.h \ thread_info.h timex.h tlbflush.h traps.h uaccess.h ucontext.h \ unaligned.h unistd.h" mergefile() { BASE=${1%.h} git mv ${SOURCE}/$1 ${TARGET}/${BASE}_no.h git mv ${TARGET}/$1 ${TARGET}/${BASE}_mm.h cat << EOF > ${TARGET}/$1 EOF git add ${TARGET}/$1 } set -e mkdir -p ${TARGET} git mv include/asm-m68k/* ${TARGET} rmdir include/asm-m68k git rm ${SOURCE}/Kbuild for F in $INCLUDE $EQUAL; do git rm ${SOURCE}/$F done for F in $NOMUUFILES; do git mv ${SOURCE}/$F ${TARGET}/$F done for F in $FILES ; do mergefile $F done rmdir arch/m68knommu/include/asm rmdir arch/m68knommu/include Cc: Arnd Bergmann <arnd@arndb.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Greg Ungerer <gerg@uclinux.org> |