linux/drivers
Kirill Smelkov e3a8b4d22b [media] vivi: Optimize gen_text()
I've noticed that vivi takes a lot of CPU to produce its frames.
For example for 8 devices and 8 simple programs running, where each
captures YUY2 640x480 and displays it to X via SDL, profile timing is as
follows:
    # cmdline : /home/kirr/local/perf/bin/perf record -g -a sleep 20
    # Samples: 82K of event 'cycles'
    # Event count (approx.): 31551930117
    #
    # Overhead          Command         Shared Object                                                           Symbol
    # ........  ...............  ....................
    #
        49.48%           vivi-*  [vivi]                [k] gen_twopix
        10.79%           vivi-*  [kernel.kallsyms]     [k] memcpy
        10.02%             rawv  libc-2.13.so          [.] __memcpy_ssse3
         8.35%           vivi-*  [vivi]                [k] gen_text.constprop.6
         5.06%             Xorg  [unknown]             [.] 0xa73015f8
         2.32%             rawv  [vivi]                [k] gen_twopix
         1.22%             rawv  [vivi]                [k] precalculate_line
         1.20%           vivi-*  [vivi]                [k] vivi_fillbuff
    (rawv is display program, vivi-* is a combination of vivi-000 through vivi-007)
so a lot of time is spent in gen_twopix() which as the follwing
call-graph profile shows ...
    49.48%           vivi-*  [vivi]                [k] gen_twopix
                     |
                     --- gen_twopix
                        |
                        |--96.30%-- gen_text.constprop.6
                        |          vivi_fillbuff
                        |          vivi_thread
                        |          kthread
                        |          ret_from_kernel_thread
                        |
                         --3.70%-- vivi_fillbuff
                                   vivi_thread
                                   kthread
                                   ret_from_kernel_thread
... is called mostly from gen_text().
If we'll look at gen_text(), in the inner loop, we'll see
    if (chr & (1 << (7 - i)))
            gen_twopix(dev, pos + j * dev->pixelsize, WHITE, (x+y) & 1);
    else
            gen_twopix(dev, pos + j * dev->pixelsize, TEXT_BLACK, (x+y) & 1);
which calls gen_twopix() for every character pixel, and that is very
expensive, because gen_twopix() branches several times.
Now, let's note, that we operate on only two colors - WHITE and
TEXT_BLACK, and that pixel for that colors could be precomputed and
gen_twopix() moved out of the inner loop. Also note, that for black
and white colors even/odd does not make a difference for all supported
pixel formats, so we could stop doing that `odd` gen_twopix() parameter
game.
So the first thing we are doing here is
    1) moving gen_twopix() calls out of gen_text() into vivi_fillbuff(),
       to pregenerate black and white colors, just before printing
       starts.
what we have next is that gen_text's font rendering loop, even with
gen_twopix() calls moved out, was inefficient and branchy, so let's
    2) rewrite gen_text() loop so it uses less variables + unroll char
       horizontal-rendering loop + instantiate 3 code paths for pixelsizes 2,3
       and 4 so that in all inner loops we don't have to branch or make
       indirections (*).
Done all above reworks, for gen_text() we get nice, non-branchy
streamlined code (showing loop for pixelsize=2):
           ?       cmp    $0x2,%eax
           ?     ? jne    26
           ?       mov    -0x18(%ebp),%eax
           ?       mov    -0x20(%ebp),%edi
           ?       imul   -0x20(%ebp),%eax
           ?       movzwl 0x3ffc(%ebx),%esi
      0,08 ?       movzwl 0x4000(%ebx),%ecx
      0,04 ?       add    %edi,%edi
           ?       mov    0x0,%ebx
      0,51 ?       mov    %edi,-0x1c(%ebp)
           ?       mov    %ebx,-0x14(%ebp)
           ?       movl   $0x0,-0x10(%ebp)
           ?       lea    0x20(%edx,%eax,2),%eax
           ?       mov    %eax,-0x18(%ebp)
           ?       xchg   %ax,%ax
      0,04 ? a0:   mov    0x8(%ebp),%ebx
           ?       mov    -0x18(%ebp),%eax
      0,04 ?       movzbl (%ebx),%edx
      0,16 ?       test   %dl,%dl
      0,04 ?     ? je     128
      0,08 ?       lea    0x0(%esi),%esi
      1,61 ? b0:???shl    $0x4,%edx
      1,02 ?    ?  mov    -0x14(%ebp),%edi
      2,04 ?    ?  add    -0x10(%ebp),%edx
      2,24 ?    ?  lea    0x1(%ebx),%ebx
      0,27 ?    ?  movzbl (%edi,%edx,1),%edx
      9,92 ?    ?  mov    %esi,%edi
      0,39 ?    ?  test   %dl,%dl
      2,04 ?    ?  cmovns %ecx,%edi
      4,63 ?    ?  test   $0x40,%dl
      0,55 ?    ?  mov    %di,(%eax)
      3,76 ?    ?  mov    %esi,%edi
      0,71 ?    ?  cmove  %ecx,%edi
      3,41 ?    ?  test   $0x20,%dl
      0,75 ?    ?  mov    %di,0x2(%eax)
      2,43 ?    ?  mov    %esi,%edi
      0,59 ?    ?  cmove  %ecx,%edi
      4,59 ?    ?  test   $0x10,%dl
      0,67 ?    ?  mov    %di,0x4(%eax)
      2,55 ?    ?  mov    %esi,%edi
      0,78 ?    ?  cmove  %ecx,%edi
      4,31 ?    ?  test   $0x8,%dl
      0,67 ?    ?  mov    %di,0x6(%eax)
      5,76 ?    ?  mov    %esi,%edi
      1,80 ?    ?  cmove  %ecx,%edi
      4,20 ?    ?  test   $0x4,%dl
      0,86 ?    ?  mov    %di,0x8(%eax)
      2,98 ?    ?  mov    %esi,%edi
      1,37 ?    ?  cmove  %ecx,%edi
      4,67 ?    ?  test   $0x2,%dl
      0,20 ?    ?  mov    %di,0xa(%eax)
      2,78 ?    ?  mov    %esi,%edi
      0,75 ?    ?  cmove  %ecx,%edi
      3,92 ?    ?  and    $0x1,%edx
      0,75 ?    ?  mov    %esi,%edx
      2,59 ?    ?  mov    %di,0xc(%eax)
      0,59 ?    ?  cmove  %ecx,%edx
      3,10 ?    ?  mov    %dx,0xe(%eax)
      2,39 ?    ?  add    $0x10,%eax
      0,51 ?    ?  movzbl (%ebx),%edx
      2,86 ?    ?  test   %dl,%dl
      2,31 ?    ???jne    b0
      0,04 ?128:   addl   $0x1,-0x10(%ebp)
      4,00 ?       mov    -0x1c(%ebp),%eax
      0,04 ?       add    %eax,-0x18(%ebp)
      0,08 ?       cmpl   $0x10,-0x10(%ebp)
           ?     ? jne    a0
which almost goes away from the profile:
    # cmdline : /home/kirr/local/perf/bin/perf record -g -a sleep 20
    # Samples: 49K of event 'cycles'
    # Event count (approx.): 16799780016
    #
    # Overhead          Command         Shared Object                                                           Symbol
    # ........  ...............  ....................
    #
        27.51%             rawv  libc-2.13.so          [.] __memcpy_ssse3
        23.77%           vivi-*  [kernel.kallsyms]     [k] memcpy
         9.96%             Xorg  [unknown]             [.] 0xa76f5e12
         4.94%           vivi-*  [vivi]                [k] gen_text.constprop.6
         4.44%             rawv  [vivi]                [k] gen_twopix
         3.17%           vivi-*  [vivi]                [k] vivi_fillbuff
         2.45%             rawv  [vivi]                [k] precalculate_line
         1.20%          swapper  [kernel.kallsyms]     [k] read_hpet
i.e. gen_twopix() overhead dropped from 49% to 4% and gen_text() loops
from ~8% to ~4%, and overal cycles count dropped from 31551930117 to
16799780016 which is ~1.9x whole workload speedup.
(*) for RGB24 rendering I've introduced x24, which could be thought as
    synthetic u24 for simplifying the code. That's done because for
    memcpy used for conditional assignment, gcc generates suboptimal code
    with more indirections.
    Fortunately, in C struct assignment is builtin and that's all we
    need from pixeltype for font rendering.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
Acked-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2012-12-21 18:25:21 -02:00
..
accessibility
acpi ACPI video: Ignore errors after _DOD evaluation. 2012-11-03 09:52:54 +08:00
amba
ata SCSI fixes on 20121122 2012-11-22 09:14:54 -10:00
atm atm: forever loop loading ambassador firmware 2012-11-28 11:38:11 -05:00
auxdisplay
base Merge remote-tracking branch 'linus/master' into staging/for_v3.8 2012-11-28 07:22:38 -02:00
bcma bcma: fix unregistration of cores 2012-10-15 14:45:51 -04:00
block mtip32xx: Fix padding issue 2012-11-23 14:32:55 +01:00
bluetooth Bluetooth: ath3k: Add support for VAIO VPCEH [0489:e027] 2012-11-09 16:45:37 +01:00
bus drivers: bus: ocp2scp: add pdata support 2012-11-07 09:35:53 -08:00
cdrom
char Merge branch 'block-dev' 2012-12-03 10:53:25 -08:00
clk clk: ux500: Register slimbus clock lookups for u8500 2012-11-12 10:20:23 -08:00
clocksource
connector
cpufreq cpufreq / powernow-k8: Change maintainer's email address 2012-10-31 21:02:57 +01:00
cpuidle ACPI idle, CPU hotplug: Fix NULL pointer dereference during hotplug 2012-10-08 22:52:54 -04:00
crypto IXP4xx crypto: MOD_AES{128,192,256} already include key size. 2012-11-22 03:36:15 +00:00
dca
devfreq
dio
dma Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma 2012-10-26 14:59:01 -07:00
edac Merge git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac 2012-12-03 11:16:37 -08:00
eisa
extcon extcon : register for cable interest by cable name 2012-10-23 16:32:18 +09:00
firewire [SCSI] sd: Implement support for WRITE SAME 2012-11-13 22:45:42 -08:00
firmware firmware/memmap: avoid type conflicts with the generic memmap_init() 2012-10-19 14:07:47 -07:00
gpio gpio-mcp23s08: Build I2C support even when CONFIG_I2C=m 2012-11-17 22:22:24 +01:00
gpu Merge branch 'drm-fixes-3.7' of git://people.freedesktop.org/~agd5f/linux 2012-11-28 16:51:10 +10:00
hid Merge remote-tracking branch 'linus/master' into staging/for_v3.8 2012-11-28 07:22:38 -02:00
hsi
hv Drivers: hv: Cleanup error handling in vmbus_open() 2012-10-24 15:46:27 -07:00
hwmon hwmon: Fix chip feature table headers 2012-11-05 21:54:40 +01:00
hwspinlock
i2c Merge branch 'i2c-embedded/for-current' of git://git.pengutronix.de/git/wsa/linux 2012-11-23 11:59:26 -10:00
ide
idle
iio iio: Remove duplicates for light/ in Kconfig and Makefile 2012-10-19 19:44:06 +01:00
infiniband Merge branches 'cxgb4' and 'mlx4' into for-next 2012-10-23 09:03:49 -07:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2012-11-22 21:45:34 -10:00
iommu intel-iommu: Fix lookup in add device 2012-11-17 13:27:15 +01:00
irqchip irqchip: irq-bcm2835: Add terminating entry for of_device_id table 2012-11-06 07:37:10 -08:00
isdn isdn: Make CONFIG_ISDN depend on CONFIG_NETDEVICES 2012-11-07 18:59:26 -05:00
leds ledtrig-cpu: kill useless mutex to fix sleep in atomic context 2012-11-11 12:09:43 -08:00
lguest Merge branch 'virtio-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux 2012-10-07 21:04:56 +09:00
macintosh
md Single bugfix for raid1/raid10. 2012-12-02 16:24:31 -08:00
media [media] vivi: Optimize gen_text() 2012-12-21 18:25:21 -02:00
memory
memstick
message
mfd mfd: twl4030: Fix chained irq handling on resume from suspend 2012-11-21 17:46:41 +01:00
misc pwm: Changes for v3.7-rc1 2012-10-10 20:15:24 +09:00
mmc mmc: sdhci-s3c: fix the card detection in runtime-pm 2012-11-07 15:40:52 -05:00
mtd revert "Revert "mm: remove __GFP_NO_KSWAPD"" 2012-11-30 08:51:17 -08:00
net Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-12-02 16:39:00 -08:00
nfc NFC: Fix pn533 target mode memory leak 2012-11-20 00:09:26 +01:00
nubus
of of/platform: sparse fix 2012-10-17 15:53:03 -05:00
oprofile mm: use mm->exe_file instead of first VM_EXECUTABLE vma->vm_file 2012-10-09 16:22:18 +09:00
parisc
parport Xtensa patchset for 3.7 2012-10-09 16:11:46 +09:00
pci PCI/portdrv: Don't create hotplug slots unless port supports hotplug 2012-11-05 16:59:59 -07:00
pcmcia Merge branch 'testing/driver-warnings' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc into fixes 2012-10-19 15:40:18 -07:00
pinctrl pinctrl/samsung: don't allow enabling pinctrl-samsung standalone 2012-11-15 11:58:24 +01:00
platform Merge branches 'fixes-for-37', 'ec' and 'thermal' into release 2012-10-09 01:47:35 -04:00
pnp
power Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux 2012-10-13 11:27:59 +09:00
pps
ps3
ptp
pwm pwm: Changes for v3.7-rc1 2012-10-10 20:15:24 +09:00
rapidio rapidio: fix kernel-doc warnings 2012-11-16 14:33:04 -08:00
regulator Merge remote-tracking branches 'regulator/fix/gpio', 'regulator/fix/put' and 'regulator/fix/supp-volt' into tmp 2012-11-15 11:16:02 +09:00
remoteproc remoteproc: fix error path of ->find_vqs 2012-11-29 10:05:09 +02:00
rpmsg Merge branch 'virtio-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux 2012-10-07 21:04:56 +09:00
rtc drivers/rtc/rtc-tps65910.c: fix invalid pointer access on _remove() 2012-11-30 08:51:18 -08:00
s390 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-11-16 14:10:15 -08:00
sbus
scsi [SCSI] sd: Implement support for WRITE SAME 2012-11-13 22:45:42 -08:00
sfi
sh sh: Fix up more fallout from pointless ARM __iomem churn. 2012-10-15 14:08:48 +09:00
sn
spi spi: Some minor MXS fixes 2012-10-28 11:13:54 -07:00
ssb
staging [media] staging/media: Use dev_ printks in cxd2099/cxd2099.[ch] 2012-12-21 18:25:16 -02:00
target target: Fix handling of aborted commands 2012-11-17 13:35:44 -08:00
tc
thermal exynos4_tmu_driver_ids should be exynos_tmu_driver_ids. 2012-11-03 09:52:55 +08:00
tty tty vt: Fix a regression in command line edition 2012-11-21 16:45:32 -08:00
uio mm: kill vma flag VM_RESERVED and mm->reserved_vm counter 2012-10-09 16:22:19 +09:00
usb SCSI fixes on 20121122 2012-11-22 09:14:54 -10:00
uwb
vfio vfio: Fix PCI INTx disable consistency 2012-10-10 09:10:32 -06:00
vhost vhost: fix length for cross region descriptor 2012-11-28 11:27:01 -05:00
video omapdss fixes for 3.7-rc 2012-11-23 12:01:02 -10:00
virt
virtio virtio: Don't access index after unregister. 2012-11-09 14:54:24 +10:30
vlynq
vme
w1
watchdog Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu 2012-10-07 21:06:10 +09:00
xen Bug-fix: 2012-11-20 18:52:01 -10:00
zorro
Kconfig
Makefile IPMI: Change link order 2012-10-16 18:07:12 -07:00