项目作者: sgryjp

项目描述 :
SIMD instruction benchmark
高级语言: C
项目地址: git://github.com/sgryjp/simd_test.git
创建时间: 2017-01-28T13:21:39Z
项目社区:https://github.com/sgryjp/simd_test

开源协议:MIT License

下载


simd_test

SIMD instruction benchmark

Created to learn how to use SIMD instructions in C with compiler intrinsics
and x64 assembly language (with Visual Studio for Windows program.)

This program executes some very basic linear algebra (vector) operations and
also contains logic to determine whether the CPU supports some SIMD instruction
set or not (the program isn’t using it though). If you interested, just look at
the source files. Note that there are no x86 (Win32) version of .asm files.

Example Results

An example output from the program built with Visual Studio 2015
(on a laptop with Core i7 6600U, Skylake architecture):

  1. Number of times each function ran in 0.5 secs:
  2. add_c : 471470 times
  3. add_sse : 496611 times
  4. add_avx : 830946 times
  5. mul_c : 486972 times
  6. mul_sse : 497136 times
  7. mul_avx : 813090 times
  8. div_c : 498424 times
  9. div_sse : 492239 times
  10. div_avx : 595721 times
  11. mul_add_c : 384463 times
  12. mul_add_sse : 281115 times
  13. mul_add_avx : 506863 times
  14. dot_c : 96564 times
  15. dot_sse : 377050 times
  16. dot_sse41_dp : 254332 times
  17. dot_sse_asm : 387838 times
  18. dot_avx : 306822 times
  19. dot_avx_dp : 437111 times
  20. dot_avx_asm : 384199 times

An example output from the program built with Visual Studio 2008:

  1. Number of times each function ran in 0.5 secs:
  2. add_c : 222883 times
  3. add_sse : 642514 times
  4. mul_c : 233686 times
  5. mul_sse : 667243 times
  6. div_c : 127245 times
  7. div_sse : 508287 times
  8. mul_add_c : 157523 times
  9. mul_add_sse : 345454 times
  10. dot_c : 96752 times
  11. dot_sse : 380795 times
  12. dot_sse41_dp : 251552 times
  13. dot_sse_asm : 379998 times

Notes

Some notes for who reads this readme:

  • The results above are not a comparison of C, SSE, and AVX — it’s just a
    result of my implementation.
  • Benchmark result is very unstable so that speed ranking changes almost
    every time I’ve executed — test yourself.

Some notes I’ve learned by writing this:

  • Since every x64 CPU supports SSE, SSE checking logic is not needed for 64-bit
    program.
  • As naturally expected, VS2015 emits much optimized code than VS2008’s one.
  • As for the cost of implementation, compiler intrinsics is better than
    hand-written assembly — significantly easier to write, safer because of
    type checking, no need to reimplement architecture by architecture, etc.

One important thing to note. Even though output from code using compiler
intrinsics is generally as fast as fine-tuned hand-written assembly, it CAN be
slower. Actually I couldn’t write effective code using instrinsics first time.
Then I checked how my code was translated at disassembly window in Visual Studio,
learned which part is inefficient, and tried to change my code more compiler
friendly to make that part efficient. The problem in my case was that the
intrinsic function I chose was not optimal for the case. Since compiler
intrinsics directs the compiler to use the very specific instruction, choosing
unsuitable intrinsics forces the compiler to emit the unsuitable instruction,
and forces to make a detour. To avoid such pitfalls, and to shoot such
troubles, I felt that it’s better to know basic assembly language to analyze
what’s going on behind the scene.

Once I understood basics of x64 assembly, there seems to be no reason to choose
writing assembly code by hand any more.