Improving performance with SIMD intrinsics in three use cases
When done right, supplementing C or C++ code with vector intrinsics is exceptionally good for performance. For the cases presented in this blog post, vectorization improved performance by a factor of 3 to 12.
Introduction
Many developers write software that’s performance sensitive. After all, that’s one of the major reasons why we still pick C or C++ language these days.
All modern processors are actually vector under the hood. Unlike scalar processors, which process data individually, modern vector processors process one-dimensional arrays of data. If you want to maximize performance, you need to write code tailored to these vectors.
Every time you write float s = a + b;
you’re leaving a lot of performance on the table. The processor could have added four float numbers to another four numbers, or even eight numbers to another eight numbers if that processor supports AVX. Similarly, when you write int i = j + k;
to add 2 integer numbers, you could have added four or eight numbers instead, with corresponding SSE2 or AVX2 instructions.
Language designers, compiler developers, and other smart people have been trying for many years to compile scalar code into vector instructions in a way that would leverage the performance potential. So far, none of them have completely succeeded, and I’m not convinced it’s possible.
One approach to leverage vector hardware are SIMD intrinsics, available in all modern C or C++ compilers. SIMD stands for “single Instruction, multiple data”. SIMD instructions are available on many platforms, there’s a high chance your smartphone has it too, through the architecture extension ARM NEON. This article focuses on PCs and servers running on modern AMD64 processors.
Even with the focus on AMD64 platform, the topic is way too broad for a single blog post. Modern SIMD instructions were introduced to Pentium processors with the release of Pentium 3 in 1999 (that instruction set is SSE, nowadays it’s sometimes called SSE 1), more of them have been added since then. For a more in-depth introduction, you can read my other article on the subject. Unlike this blog post, that one doesn’t have practical problems nor benchmarks, instead it tries to provide an overview of what’s available.
What are vector intrinsics?
To a programmer, intrinsics look just like regular library functions; you include the relevant header, and you can use the intrinsic. To add four float numbers to another four numbers, use the _mm_add_ps
intrinsic in your code. In the compiler-provided header declaring that intrinsic, <xmmintrin.h>
, you’ll find this declaration (Assuming you’re using VC++ compiler. In GCC you’ll see something different, which provides the same API to a user.):
extern __m128 _mm_add_ps( __m128 _A, __m128 _B );
But unlike library functions, intrinsics are implemented directly in compilers. The above _mm_add_ps
SSE intrinsic typically1 compiles into a single instruction, addps. For the time it takes CPU to call a library function, it might have completed a dozen of these instructions.
1(That instruction can fetch one of the arguments from memory, but not both. If you call it in a way so the compiler has to load both arguments from memory, like this __m128 sum = _mm_add_ps( *p1, *p2 );
the compiler will emit two instructions: the first one to load an argument from memory into a register, the second one to add the four values.)
The __m128
built-in data type is a vector of four floating point numbers; 32 bits each, 128 bits in total. CPUs have wide registers for that data type, 128 bits per register. Since AVX was introduced in 2011, in current PC processors these registers are 256 bits wide, each one of them can fit eight float values, four double-precision float values, or a large number of integers, depending on their size.
Source code that contains sufficient amounts of vector intrinsics or embeds their assembly equivalents is called manually vectorized code. Modern compilers and libraries already implement a lot of stuff with them using intrinsics, assembly, or a combination of the two. For example, some implementations of the memset
, memcpy
, or memmove
standard C library routines use SSE2 instructions for better throughput. Yet outside of niche areas like high-performance computing, game development, or compiler development, even very experienced C and C++ programmers are largely unfamiliar with SIMD intrinsics.
To help demonstrate, I’m going to present three practical problems and discuss how SIMD helped.
Image processing: grayscale
Suppose that we need to write a function that converts RGB image to grayscale. Someone asked this very question recently.
Many practical applications need code like this. For example, when you compress raw image data to JPEG or video data to H.264 or H.265, the first step of the compression is quite similar. Specifically, compressors convert RGB pixels into YUV color space. The exact color space is defined in the specs of these formats—for video, it’s often ITU-R BT.709 these days See section 3, “Signal format” of that spec.
Performance comparison
I’ve implemented a few versions, vectorized and not, and tested them with random images. Mydesktop has an AMD Ryzen 5 3600 plugged in, my laptop has an Intel i3-6157U soldered. WSL column has results from the same desktop, but for a Linux binary built with GCC 7.4. The three rightmost columns of the table contain time in milliseconds (best of five runs), for an image of 3840×2160 pixels.

Observations
Vectorized versions are three to eight times faster than scalar code. On the laptop, the scalar version is likely too slow to handle 60 FPS video of frames of this size, while the performance of vectorized code is OK for that.
The best way to vectorize that particular algorithm appears to be fixed-point 16-bit math. Vector registers fit twice as many 16-bit integers as 32-bit floats, allowing to process twice as many pixels in parallel spending approximately the same time. On my desktop, _mm_mul_ps
SSE 1 intrinsic (multiplies four floats from 128-bit registers) has 3 cycles latency, and 0.5 cycles throughput. _mm_mulhi_epu16
SSE 2 intrinsic (multiplies eight fixed-point numbers from 128-bit registers) has the same 3 cycles latency and 1 cycle throughput.
In my experience, this outcome is common for image and video processing on CPU, not just for this particular grayscale problem.
On the desktop, upgrading from SSE to AVX—with twice as wide SIMD vectors—only improved performance a tiny bit. On the laptop it helped substantially. A likely reason for that is the RAM bandwidth bottleneck on the desktop. This is common, too, over the course of many years, CPU performance has been growing somewhat faster than memory bandwidth.
General math: dot product
Write a function to compute a dot product of two float vectors. Here’s a relevant Stack Overflow question. A popular application for dot products these days is machine learning.
Performance comparison
I didn’t want to bottleneck on memory again, so I’ve made a test that computes a dot product of 256k-long vectors, taking 1MB RAM each. That amount of data fits in processor caches on both computers I’m using for benchmarks: the desktop has a 3MB L2 cache and a 32MB L3 cache, the laptop has a 3MB L3 cache and a 64MB L4 cache. The three rightmost columns are microseconds (µs), best of ten runs.

Observations
Best versions are 5-12 times faster than scalar code.
The best SSE1-only version, SseVertical4
, delivered close performance to AVX+FMA. A likely reason for that is memory bandwidth. The source data is in the cache, so the bandwidth itself is very high. However, CPUs can only do a couple loads per cycle. The code reads from two input arrays at once and is likely to hit that limit.
When built with VC++, single accumulator non-FMA SSE and especially AVX versions performed surprisingly well. I’ve looked at the disassembly. The compiler managed to hide some latency with instructions reordering. The code computes the product, increments the pointers, adds product to the accumulator, and finally tests for loop exit condition. This way, vector and scalar instructions are interleaved, hiding the latency of both. To an extent: the four-accumulators version is still faster.
The GCC-built scalar version is quite slow. This might be caused by my compiler options in CMakeLists.txt
. I’m not sure they’re good enough, because for the last few years, I only built Linux software running on ARM devices.
Why multiple accumulators?
Data dependencies is the main thing I’d like to illustrate with this example.
From a computer scientist point of view, dot product is a form of reduction. The algorithm needs to process large input vectors, and compute just a single value. When the computations are fast (like in this case, multiplying floats from sequential blocks of memory is very fast), the throughput is often limited by latency of the reduce operation.
Let’s compare code of two specific versions, AvxVerticalFma
and AvxVerticalFma2
. The former has the following main loop:
for( ; p1 < p1End; p1 += 8, p2 += 8 ) { const __m256 a = _mm256_loadu_ps( p1 ); const __m256 b = _mm256_loadu_ps( p2 ); acc = _mm256_fmadd_ps( a, b, acc ); // Update the only accumulator }
AvxVerticalFma2
version runs following code:
for( ; p1 < p1End; p1 += 16, p2 += 16 ) { __m256 a = _mm256_loadu_ps( p1 ); __m256 b = _mm256_loadu_ps( p2 ); dot0 = _mm256_fmadd_ps( a, b, dot0 ); // Update the first accumulator a = _mm256_loadu_ps( p1 + 8 ); b = _mm256_loadu_ps( p2 + 8 ); dot1 = _mm256_fmadd_ps( a, b, dot1 ); // Update the second accumulator }
_mm256_fmadd_ps
intrinsic computes (a*b)+c for arrays of eight float values, that instruction is part of FMA3 instruction set. The reason why AvxVerticalFma2
version is almost 2x faster—deeper pipelining hiding the latency.
When the processor submits an instruction, it needs values of the arguments. If some of them are not yet available, the processor waits for them to arrive. The tables on https://www.agner.org/ say on AMD Ryzen the latency of that FMA instruction is five cycles. This means once the processor started to execute that instruction, the result of the computation will only arrive five CPU cycles later. When the loop is running a single FMA instruction which needs the result computed by the previous loop iteration, that loop can only run one iteration in five CPU cycles.
With two accumulators that limit is the same, five cycles. However, the loop body now contains two FMA instructions that don’t depend on each other. These two instructions run in parallel, and the code delivers twice the throughput on the desktop.
Not the case on the laptop, though. The laptop was clearly bottlenecked on something else, but I’m not sure what was that.
Bonus chapter: precision issues
Initially this benchmark used much larger vectors at 256 MB each. I quickly discovered the performance in that case was limited by memory bandwidth, with not much differentiation showing up in the results.
There was another interesting issue, however.
Besides just measuring the time, my test program prints the computed dot product. This is to make sure compilers don’t optimize away the code and to check that the result is the same across my two computers and 15 implementations.
I was surprised to see the scalar version printed 1.31E+7 while all other versions printed 1.67E+7. Initially, I thought it was a bug somewhere. I implemented a scalar version that uses double-precision accumulator, and sure enough, it printed 1.67E+7.
That whopping 20% error was caused by accumulation order. When a code adds a small float value to a large float value, a lot of precision is lost. An extreme example: when the first float value is larger than 8.4 million and the second value is smaller than 1.0, it won’t add anything at all. It will just return the larger of the two arguments!
Technically, you can often achieve a more precise result with a pairwise summation approach. My vectorized code doesn’t quite do that. Still, the four-accumulators AVX version accumulates 32 independent scalar values (four registers with eight floats each), which is a step in the same direction. When there are 64 million numbers to sum up, 32 independent accumulators helped a lot with the precision.
Image processing: flood fill
For the final part of the article, I’ve picked a slightly more complicated problem.
For a layman, flood fill is what happens when you open an image in an editor, select the “paint bucket” tool, and click on the image. Mathematically, it’s a connected-component labeling operating on a regular 2D grid graph.
Unlike the first two problems, it’s not immediately clear how to vectorize this one. It’s not an embarrassingly parallel problem. In fact, flood fill is quite hard to efficiently implement on GPGPU. Still, with some efforts, it’s possible to use SIMD in a way that significantly outperforms scalar code.
Because of the complexity, I only created two implementations. The first, the scalar version, is scanline fill, described in Wikipedia. Not too optimized, but not particularly slow either.
The second, the vectorized version, is a custom implementation. It requires AVX2. It splits the image into a 2D array of small dense blocks (in my implementation the blocks are 16×16 pixels, one bit per pixel), then I run something resembling Wikipedia’s forest fire algorithm, only instead of individual pixels I process complete blocks.
On the results table below, the numbers shown are in millisecond. I ran each implementation on two images: maze-diagonal.png, 2212×2212 pixels, filled from the point x=885 y=128; and shapes.png, 1024×1024 pixels, filled from the same point. Due to the nature of the problem, the time it takes to fill an image depends a lot on the image and other input parameters. For the first test, I’ve deliberately picked an image which is relatively hard to flood fill.

As you see from the table, vectorization improved performance by a factor of 1.9-3.5, depending on CPU, compiler, and the image. Both test images are in the repository, in the FloodFill/Images subfolder.
Conclusions
The performance win is quite large in practice.
The engineering overhead for vectorized code is not insignificant, especially for the flood fill, where the vectorized version has three to four times more code than the scalar scanline fill version. Admittedly, vectorized code is harder to read and debug; the difference fades with experience, but never disappears.
Source Code
I have posted the source code for these tests on github. It requires C++/17, and I’ve tested on Windows 10 with Visual Studio 2017 and Ubuntu Linux 18 with gcc 7.4.0. The freeware community edition of the visual studio is fine. I have only tested 64-bit builds. The code is published under the copy/paste-friendly terms of MIT license.
Because this article is targeted towards people unfamiliar with SIMD, I wrote more comments than I normally do, and I hope they help.
Here’s the commands I used to build the test projects on Linux:
mkdir build cd build cmake ../ make
This blog post is about intrinsics, not C++/17. The C++ parts are less than ideal, I’ve implemented bare minimum required for the benchmarks. The flood fill project includes stb_image
and stb_image_write
third-party libraries to handle PNG images: http://nothings.org/stb. Again, this is not something I would probably do in a production-quality C++ code. OS-provided image codecs are generally better, libpng on Linux, or WIC on Windows.
I hope this gives you a sense of what’s possible when you tap into the power of SIMD intrinsics.
Tags: parallel processing, simd
24 Comments
Have you tried Intel ISPC? A lot easier and more flexible than writing SIMD intrinsics by hand.
It’s not more flexible, it’s less flexible. One of the examples above used PMULHUW, how do you tell ISPC to do that? What about the alternative implementation with PMADDUBSW and PMADDWD, how do you tell ISPC to do that?
The dot product and similar problems could have been done with ISPC.
Agree with Harold.
Also, dot product and some other common algorithms already have decent implementations in third-party manually vectorized libraries, like Eigen or Intel MKL. These are easier to integrate compared to ISPC, no need for another compiler or language, they’re just libraries, Eigen is even header-only.
I’ve picked dot product because of that SO question, also because it was simple enough to implement and benchmark a dozen of versions. When I need dot product in a production software I’m working on, I usually start with Eigen, and more often than not it’s good enough already. Just don’t forget to enable SIMD support with macros, these EIGEN_VECTORIZE_AVX2, EIGEN_VECTORIZE_AVX, EIGEN_VECTORIZE_FMA, etc.
Haven’t tried that but I’ve read about it. Can’t say I like it too much. One more compiler in the build system, especially bad on Windows where you usually use VC++ not clang. No ARM NEON support.
If you like C#, there are SIMD types available: https://docs.microsoft.com/en-us/dotnet/standard/simd
I think they’ll only become good enough after couple mayor versions of .NET.
The classes you linked only contain 32-bit floating-point classes. If that’s what you need you’re lucky, but as you see from the examples, 2 out of 3 examples use integer SIMD instructions. For images, video and audio, integer SIMD is often the way to go. For HPC stuff, more often than not you need 64-bit doubles as opposed to 32-bit floats.
The new .NET Core 3.0 has support for explicit SIMD, however (a) no ARM NEON (b) Things like this sometimes make it much slower than equivalent C++: https://github.com/dotnet/runtime/issues/13811
Hey, gret post!!
I’m surprised you wrote this 2 days ago!!
idea 1: Would be great if you could share resources to learn simd from.
idea 2: robust sse implementation. a post on how to write code that uses a reasonably well-stablished subset of sse/avx instructions but still fallbacks to a robust plainol’ vanilla c or c++ in case these are not supported by the host machine.
Anyway. Thank you very much. This was helpful.
Huh? I have yet to find a compiler that doesn’t optimize a dot product to use SIMD, latency hiding included. I don’t think disregarding all the effort put into auto-vectorization ,probably by not enabling the relevant compiler switches, is fair. There is some limited use for SIMD intrinsics in critical sections, but usually, and by that I mean in 99.9% of cases, writing clear code will enable the compiler to optimize it just fine.
> I have yet to find a compiler that doesn’t optimize a dot product to use SIMD, latency hiding included.
They don’t do that. Just look at the performance tables in that article. As you see, manually vectorized version is multiple times faster even on GCC.
For code like dot product, GCC indeed unrolls the inner loop and does use some SIMD, unfortunately it fails to use multiple accumulators, so the code bottlenecks on latency. Not sure it uses FMA either.
Alright, I was mistaken here. Gcc seems to indeed not hide latency for the naive version.
Gcc definitely uses FMA for the dot product though, I just checked with compiler explorer. Make sure to compile with -mfma.
Now, I’ve got an idea: If you want to use multiple accumulators to hide latency … just do that! Using 2 (high level) accumulator variables doubles throughput for me, and using 4 makes for some more percent. Have you also investigated this? Maybe my quick test is also flawed, but if this is correct, it once again shows that there is little need for low-level fiddling these days, if you know what you actually want, and know how to convey it to the compiler.
By the way, if you’re not aware: There’s a compiler switch for us Ryzen / Zen folk, -march=znver1. While it in my experience does only change performance marginally for most workloads, it does sometimes create more readable asm, especially for SIMD, for some reason.
> it once again shows that there is little need for low-level fiddling these days
2 out of 3 examples in the article, dot product and brightness, are very simple, otherwise I wouldn’t be able to implement that many versions of each. Very likely, you indeed can use some compiler flags, and/or rewrite the code so the automatic vectorizer will do the right thing. However, doing so you’d vendor-lock yourself to a specific compiler, maybe even to a specific version of it.
Even ignoring multiple compilers, it does not mean there is little need for intrinsics. Just look at the vectorized version of my flood fill algorithm. No amount of auto-vectorization will turn the scanline version into block-based SIMD-optimized version, these are completely different algorithms operating on different internal data structures.
In the context of vector math, a simple example of only slightly more complicated problem where auto-vectorization fails completely is a dot product of sparse * dense vector. A function like that often takes vast majority of CPU time for code dealing with sparse matrices. I recently did for conjugate gradient solver, in some closed-source software I’m working on.
No amount of automatic vectorization gonna help with that. To make it fast, it’s necessary to adjust RAM layout of the sparse matrix. Specifically, replace std::vector with std::vector, and rebuild matrix data and both indices of the matrix. https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format)
Compilers never touch memory layout of data structures; they treat it as given. Adjusting data structures for SIMD friendliness is often an important first step towards better performance.
^^ replace std::vector of double with std::vector of __m256d
The web server is filtering out angle brackets, apparently.
Also, clang outperforms gcc by a factor of 7 to 8 for a naive dot product, with the obligatory -mavx2 -mfma -ffast-math for both set, and using doubles everywhere, because it _does_ accumulate into 4 ymms. One more reason to use it. And one less reason to optimize by hand.
Compiler explorer link: https://godbolt.org/z/K1366K
Great post! It would be interesting to measure the energy used by these different implementations too. That would probably be more difficult, but energy efficiency matters especially for mobile applications. Since memory access is quite power-hungry, I have a hunch that the int16 code should use less power than the float code as it uses 2x less RAM bandwidth…
Deep inside that article there’re links to other stuff I written/made.
This is more in-depth intro to AMD64 SIMD: http://const.me/articles/simd/simd.pdf
This is offline documentation that combines Intel’s guide with better performance tables: https://github.com/Const-me/IntelIntrinsics/
About 2, I don’t think that’s needed anymore. On PC, the support for SSE2 is pretty much guaranteed. The only modern CPU without that is Intel Quark, very niche product that can’t even run a conventional Windows, was discontinued in early 2019.
All 64-bit PC processors are required to support at least SSE1 and SSE2. In practice, Steam hardware survey says 100% of Steam users have SSE3, 98.61% of them have SSSE3, and 97.94% of them have SSE 4.1.
Any half respectable programmer will understand that it’s bad practice to use intrinsics in application level code. This is a matter that belongs to library and compiler authors. To say the least, this post is of no use, because 99% of the users who should use this probably have more knowledge about this topic than you can gather here. It’s an ancient part of the c lang with more than enough papers available online.
> Those intrinsics are cpu dependent, hard to reproduce (especially the latest versions) and should not be consumed by the enduser.
And any fully respectable programmer will understand that SIMD intrinsics are very useful, and that performance matters.
They’re just functions after all.
Some applications have vast majority of code most programmers would consider “library code”.
> Those intrinsics are cpu dependent
If you target a game console, you only care about 1 specific CPU model. If you target PCs (macs and Linux included), the CPU is guaranteed to support SSE1 and SSE2, and has very high chance to support SSE up to and including 4.1. In a couple years, AVX will join the list.
> hard to reproduce (especially the latest versions)
Not sure what do you mean by that. There’re only a few instructions which deliver different results on different CPUs, like rsqrtps or rcpps. The rest of them are 100% reproducible, unless someone messes with rounding mode bits in MXCSR status register.
> and should not be consumed by the enduser
End users are different. If you don’t need them, don’t use them. There’re many people working on videogames and non-gaming 3D graphics like CAD/CAM/CAE, video processing, image processing, complex embedded software, and similar things which need to compute stuff on large amount of data as fast as possible. If you’re lucky to find a library with good enough license that does what you want, you should definitely consider it, however it’s not always the case.
Thanks for the great article!
I also tried to use SIMD several times but the biggest issue in my opinion is platform dependency.
C/C++ is already quite challenging when it comes to platform dependencies but if you use SIMD you even need to take care of the used compiler, the used SIMD instruction set, if SIMD is available at all etc. (if I remember correctly…)
In my opinion SIMD would/could be much more popular if it would be standardized.
> you even need to take care of the used compiler
The intrinsics are the same across all compilers. I routinely use documentation from Intel’s compiler to write code I’m building with vc++, gcc or clang.
> if SIMD is available at all
You are guaranteed to have at least SSE1 and SSE2. Modern operating systems (Windows 10, OSX, and most modern Linux distributions) can’t install or boot without SSE2. Also, SSE1 and 2 are required parts of AMD64 instruction set, all 64-bit PC processors in the world are required to support them.
Unless you’re developing something very specific, I think it’s generally OK for desktop software to require SSE up to and including 4.1. Steam hardware survey says only 2.06% of users don’t have SSE 4.1. The last CPU without SSE 4.1 is probably AMD Bobcat from 2011, that’s 9 years old now, computers don’t usually live that long.
The stackoverflow blog is excellent, very technical but without using a boring language, each post more interesting than the other, it’s great to keep up with the news in
computer programming and programming language!
I have published some paper that one is a review on SIMD and the name is “SIMD programming using Intel vector extensions” in Journal of Parallel and Distributed Computing 135 (2020) 83–100″ you would take your time with the paper.
I figure a lot of power is lost calculating array locations, and that if the lower dimensions were expanded to mod-2 amounts, the multiply for vector offsets could be a shift, one clock with a barrel shifter.
I think it might help to replace in tables the generic “desktop” and “laptop” entries with the actual distro you used. Because for example if you used Ubuntu, it typically contains old sw, which would explain why WSL wins sometimes (and btw, what distro did you use on WSL too?).