Home > 编程技术 > 内存拷贝优化(3)-深入优化

内存拷贝优化(3)-深入优化

December 20th, 2015 Leave a comment Go to comments

今天继续在原来内存拷贝代码上优化:

1. 修改了小内存方案:由原来64字节扩大为128字节,由 int 改为 xmm,小内存性能提升 80%
2. 修改了中内存方案:从4个xmm寄存器并行拷贝改为8个并行拷贝+prefetch,提升20%左右
3. 去除目标地址头部对齐的分支判断,用一次xmm拷贝完成目标对齐,性能替升10%。
4. 增加测试用例:为贴近实际,增加了随机访问,10MB空间内(绝对大于L2尺寸)随机位置和长度的测试

为避免随机数生成影响结果,提前生成随机数,最终平均性能达到gcc4.9配套标准库的2倍以上:

https://github.com/skywind3000/FastMemcpy

最新代码测试结果(可以对比老的表看新版本性能是否有所提升):

benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=78ms memcpy=260 ms
result(dst aligned, src unalign): memcpy_fast=78ms memcpy=250 ms
result(dst unalign, src aligned): memcpy_fast=78ms memcpy=266 ms
result(dst unalign, src unalign): memcpy_fast=78ms memcpy=234 ms

benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=109ms memcpy=281 ms
result(dst aligned, src unalign): memcpy_fast=109ms memcpy=328 ms
result(dst unalign, src aligned): memcpy_fast=109ms memcpy=343 ms
result(dst unalign, src unalign): memcpy_fast=93ms memcpy=344 ms

benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=125ms memcpy=218 ms
result(dst aligned, src unalign): memcpy_fast=156ms memcpy=484 ms
result(dst unalign, src aligned): memcpy_fast=172ms memcpy=546 ms
result(dst unalign, src unalign): memcpy_fast=172ms memcpy=515 ms

benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=109ms memcpy=172 ms
result(dst aligned, src unalign): memcpy_fast=187ms memcpy=453 ms
result(dst unalign, src aligned): memcpy_fast=172ms memcpy=437 ms
result(dst unalign, src unalign): memcpy_fast=156ms memcpy=452 ms

benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=62ms memcpy=78 ms
result(dst aligned, src unalign): memcpy_fast=109ms memcpy=202 ms
result(dst unalign, src aligned): memcpy_fast=94ms memcpy=203 ms
result(dst unalign, src unalign): memcpy_fast=110ms memcpy=218 ms

benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=62ms memcpy=78 ms
result(dst aligned, src unalign): memcpy_fast=78ms memcpy=202 ms
result(dst unalign, src aligned): memcpy_fast=78ms memcpy=203 ms
result(dst unalign, src unalign): memcpy_fast=94ms memcpy=203 ms

benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=203ms memcpy=191 ms
result(dst aligned, src unalign): memcpy_fast=219ms memcpy=281 ms
result(dst unalign, src aligned): memcpy_fast=218ms memcpy=328 ms
result(dst unalign, src unalign): memcpy_fast=218ms memcpy=312 ms

benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=312ms memcpy=406 ms
result(dst aligned, src unalign): memcpy_fast=296ms memcpy=421 ms
result(dst unalign, src aligned): memcpy_fast=312ms memcpy=468 ms
result(dst unalign, src unalign): memcpy_fast=297ms memcpy=452 ms

benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=281ms memcpy=452 ms
result(dst aligned, src unalign): memcpy_fast=280ms memcpy=468 ms
result(dst unalign, src aligned): memcpy_fast=298ms memcpy=514 ms
result(dst unalign, src unalign): memcpy_fast=344ms memcpy=472 ms

benchmark random access:
memcpy_fast=515ms memcpy=1014ms

 

老的测试结果为:

result: gcc4.9 (msvc 2012 got a similar result):
 
benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=180ms memcpy=249 ms
result(dst aligned, src unalign): memcpy_fast=170ms memcpy=271 ms
result(dst unalign, src aligned): memcpy_fast=179ms memcpy=269 ms
result(dst unalign, src unalign): memcpy_fast=180ms memcpy=260 ms
 
benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=162ms memcpy=300 ms
result(dst aligned, src unalign): memcpy_fast=199ms memcpy=328 ms
result(dst unalign, src aligned): memcpy_fast=410ms memcpy=339 ms
result(dst unalign, src unalign): memcpy_fast=390ms memcpy=361 ms
 
benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=160ms memcpy=241 ms
result(dst aligned, src unalign): memcpy_fast=200ms memcpy=519 ms
result(dst unalign, src aligned): memcpy_fast=313ms memcpy=509 ms
result(dst unalign, src unalign): memcpy_fast=311ms memcpy=520 ms
 
benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=145ms memcpy=179 ms
result(dst aligned, src unalign): memcpy_fast=180ms memcpy=430 ms
result(dst unalign, src aligned): memcpy_fast=245ms memcpy=430 ms
result(dst unalign, src unalign): memcpy_fast=230ms memcpy=455 ms
 
benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=80ms memcpy=80 ms
result(dst aligned, src unalign): memcpy_fast=110ms memcpy=205 ms
result(dst unalign, src aligned): memcpy_fast=110ms memcpy=224 ms
result(dst unalign, src unalign): memcpy_fast=110ms memcpy=200 ms
 
benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=70ms memcpy=78 ms
result(dst aligned, src unalign): memcpy_fast=100ms memcpy=222 ms
result(dst unalign, src aligned): memcpy_fast=100ms memcpy=210 ms
result(dst unalign, src unalign): memcpy_fast=100ms memcpy=230 ms
 
benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=200ms memcpy=201 ms
result(dst aligned, src unalign): memcpy_fast=260ms memcpy=270 ms
result(dst unalign, src aligned): memcpy_fast=263ms memcpy=361 ms
result(dst unalign, src unalign): memcpy_fast=267ms memcpy=321 ms
 
benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=281ms memcpy=391 ms
result(dst aligned, src unalign): memcpy_fast=265ms memcpy=407 ms
result(dst unalign, src aligned): memcpy_fast=313ms memcpy=453 ms
result(dst unalign, src unalign): memcpy_fast=282ms memcpy=439 ms
 
benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=266ms memcpy=422 ms
result(dst aligned, src unalign): memcpy_fast=250ms memcpy=407 ms
result(dst unalign, src aligned): memcpy_fast=297ms memcpy=516 ms
result(dst unalign, src unalign): memcpy_fast=281ms memcpy=436 ms

benchmark random access:
memcpy_fast=594ms memcpy=1161ms

 

旧文索引:

内存拷贝优化(1)-小内存拷贝优化

内存拷贝优化(2)-全尺寸拷贝优化

Categories: 编程技术 Tags: ,
  1. December 21st, 2015 at 18:46 | #1

    VS2015比起以前的版本进步了很多,已经有小部分反超的趋势了
    ====================================
    E:\FastMemcpy-master>cl -nologo -O2 FastMemcpy.c
    FastMemcpy.c
    c:\program files (x86)\microsoft sdks\windows\v7.1a\include\sal_supp.h(57): warning C4005: “__useHeader”: 宏重定义
    C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\sal.h(2886): note: 参见“__useHeader”的前一个定义
    c:\program files (x86)\microsoft sdks\windows\v7.1a\include\specstrings_supp.h(77): warning C4005: “__on_failure”: 宏重定义
    C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\sal.h(2896): note: 参见“__on_failure”的前一个定义

    E:\FastMemcpy-master>FastMemcpy.exe
    benchmark(size=32 bytes, times=16777216):
    result(dst aligned, src aligned): memcpy_fast=58ms memcpy=75 ms
    result(dst aligned, src unalign): memcpy_fast=58ms memcpy=77 ms
    result(dst unalign, src aligned): memcpy_fast=54ms memcpy=76 ms
    result(dst unalign, src unalign): memcpy_fast=54ms memcpy=76 ms

    benchmark(size=64 bytes, times=16777216):
    result(dst aligned, src aligned): memcpy_fast=66ms memcpy=81 ms
    result(dst aligned, src unalign): memcpy_fast=66ms memcpy=88 ms
    result(dst unalign, src aligned): memcpy_fast=66ms memcpy=87 ms
    result(dst unalign, src unalign): memcpy_fast=67ms memcpy=88 ms

    benchmark(size=512 bytes, times=8388608):
    result(dst aligned, src aligned): memcpy_fast=139ms memcpy=203 ms
    result(dst aligned, src unalign): memcpy_fast=145ms memcpy=221 ms
    result(dst unalign, src aligned): memcpy_fast=160ms memcpy=203 ms
    result(dst unalign, src unalign): memcpy_fast=162ms memcpy=207 ms

    benchmark(size=1024 bytes, times=4194304):
    result(dst aligned, src aligned): memcpy_fast=109ms memcpy=146 ms
    result(dst aligned, src unalign): memcpy_fast=113ms memcpy=157 ms
    result(dst unalign, src aligned): memcpy_fast=126ms memcpy=156 ms
    result(dst unalign, src unalign): memcpy_fast=126ms memcpy=157 ms

    benchmark(size=4096 bytes, times=524288):
    result(dst aligned, src aligned): memcpy_fast=51ms memcpy=50 ms
    result(dst aligned, src unalign): memcpy_fast=53ms memcpy=58 ms
    result(dst unalign, src aligned): memcpy_fast=54ms memcpy=68 ms
    result(dst unalign, src unalign): memcpy_fast=59ms memcpy=77 ms

    benchmark(size=8192 bytes, times=262144):
    result(dst aligned, src aligned): memcpy_fast=49ms memcpy=53 ms
    result(dst aligned, src unalign): memcpy_fast=53ms memcpy=63 ms
    result(dst unalign, src aligned): memcpy_fast=54ms memcpy=67 ms
    result(dst unalign, src unalign): memcpy_fast=54ms memcpy=73 ms

    benchmark(size=1048576 bytes, times=2048):
    result(dst aligned, src aligned): memcpy_fast=193ms memcpy=133 ms
    result(dst aligned, src unalign): memcpy_fast=193ms memcpy=141 ms
    result(dst unalign, src aligned): memcpy_fast=196ms memcpy=138 ms
    result(dst unalign, src unalign): memcpy_fast=192ms memcpy=146 ms

    benchmark(size=4194304 bytes, times=512):
    result(dst aligned, src aligned): memcpy_fast=238ms memcpy=262 ms
    result(dst aligned, src unalign): memcpy_fast=259ms memcpy=271 ms
    result(dst unalign, src aligned): memcpy_fast=288ms memcpy=319 ms
    result(dst unalign, src unalign): memcpy_fast=265ms memcpy=280 ms

    benchmark(size=8388608 bytes, times=256):
    result(dst aligned, src aligned): memcpy_fast=258ms memcpy=281 ms
    result(dst aligned, src unalign): memcpy_fast=262ms memcpy=291 ms
    result(dst unalign, src aligned): memcpy_fast=272ms memcpy=296 ms
    result(dst unalign, src unalign): memcpy_fast=268ms memcpy=302 ms

    benchmark random access:
    memcpy_fast=484ms memcpy=433ms

  2. December 21st, 2015 at 22:40 | #2

    @MuYu
    没错标准库也在不断的进步,希望越来越好吧,这样就用不着来自己优化些东西了。只是说咱们用的各种现成东西,也不能一味的“无条件信任”,就像早年看cpu指令集你会以为rep movsb是最快的,但其实并不是,比如以前gcc的list.size()并不是O(1),而是O(n),例子太多,所以即便基础库也并非完全尽善尽美,照样可以怀疑,对吧?

  3. December 21st, 2015 at 22:42 | #3

    @skywind
    你看好多开源库,都有对基础库的重新实现,比如SDL1/2里面照样有重新实现的memcpy代码(虽然没FastMemcpy快),折射出来的其实是对早期基础库版本的全面不信任,和精确控制每一步的一些编程偏好。

  4. December 26th, 2015 at 10:15 | #4

    @MuYu
    我又更新了一个版本,哈哈,提速不少,能否帮测试下2015?手头没有。

  5. nomi-otto
    January 8th, 2016 at 22:43 | #5

    @MuYu

    你如果仔细看source code的话会发现msvc 14确实比msvc 12的memcpy精细太多了
    14已经类似skywind那样分多种case去做了 而且做法已经基本上差不多 只是一些阈值设定不同
    当然是用asm写 至于tiny那块 像本文这种做法 asm写起来还是挺蛋疼的。。。
    不过我个人认为tiny这种case从我看glibc和msvc的做法看来 他们是没有特别在意这个地方

  1. No trackbacks yet.

Wordpress Social Share Plugin powered by Ultimatelysocial