Back with a post after 6 years of silence. If you had to parse a microsecond-resolution epoch timestamp as quickly as possible, how would you do it? We’ll take a look at using compiler intrinsics to do it in log(n) time.
The problem
Let’s say, theoretically, you have some text-based protocol, or file that contains microsecond timestamps. You need to parse these timestamps as quickly as possible. Maybe it’s json, maybe it’s a csv file, maybe something else bespoke. It’s 16 characters long, and this could also apply to credit card numbers.
In the end you have to implement a function similar to this:
The native solution
Let’s start with what’s available, and compare. We have
std::atoll
, a function
inherited from C,
std::stringstream
, the newer C++17
<charconv>
header, and
by request
boost::spirit::qi
.
I’ll be using Google Benchmark to
measure the performance, and to have a baseline let’s compare against loading
the final result into a register - i.e. no actual parsing involved.
Let’s run the benchmarks! The code is not important here, it just shows what is being benchmarked.
Wow, stringstream
is pretty bad. Not that it’s a fair comparison but parsing
a single integer using stringstream
is 391 times slower than just loading our
integer into a register. <charconv>
and boost::spirit
do a lot better by
comparison.
Since we know our string contains the number we’re trying to parse, and we don’t need to do any whitespace skipping, can we be faster? Just how much time is spent in validation?
The naive solution
Let’s write a good old for loop. Read the string character by character, and build up the result.
That’s actually not bad for a simple for loop. If such a simple solution is able to beat a standard-library implementation, it means there’s quite a lot of effort that goes into input validation. As a sidenote - if you know your input, or can do simpler validation you can get some significant speedups.
For further solutions and benchmarks, let’s ignore the standard library functions. We should be able to go much faster than this.
The brute force solution
If we know it’s 16 bytes, why even have a forloop? Let’s unroll it!
Ok, that’s slightly better again, but we’re still processing a character at a time.
The byteswap insight
Let’s draw out the operations in the unrolled solution as a tree, on a simplified example of parsing ‘1234’ into a 32-bit integer:
We can see that the amount of multiplications and additions is linear with the amount of characters. It’s hard to see how to improve this, because every multiplication is by a different factor (so we can’t multiply “in one go”), and at the end of the day we need to add up all the intermediate results.
However, it’s still very regular. For one thing, the first character in the string is multiplied by the largest factor, because it is the most significant digit.
On a little-endian machine (like x86), an integer’s first byte contains the least significant digits, while the first byte in a string contains the most significant digit.
Now to reinterpret the bytes of a string as an integer we have to use
std::memcpy
(to avoid strict-aliasing
violations), and we have compiler
instrinsic __builtin_bswap64
to swap the bytes in one instruction. The
std::memcpy
will get optimized out, so this is a win so far.
But now that we have an integer that kind of, sort of looks like what we want, how do we get it across the finish line without too much work?
The divide and conquer insight
From the previous step, we end up with an integer whose bit representation has each digit placed in a separate byte. I.e. even though one byte can represent up to 256 values, we have values 0-9 in each byte of the integer. They are also in the right little endian order. Now we just need to “smash” them together somehow.
We know that doing it linearly would be too slow, what’s the next possibility? O(log(n))! We need to combine every adjacent digit into a pair in one step, and then each pair of digits into a group of four, and so on, until we have the entire integer.
After I posted the first version of this article, Sopel97 on reddit pointed out that the byteswap is not necessary. Combining adjacent digits works either way - their order doesn’t matter. I realized that it helped me with the next insight, but could be omitted for the final code.
The key is working on adjacent digits simultaneously. This allows a tree of operations, running in O(log(n)) time.
This involves multiplying the even-index digits by a power of 10 and leaving the odd-index digits alone. This can be done with bitmasks to selectively apply operations
Let’s finish the parse_8_chars
function we started earlier by employing this
masking trick. As a neat side-effect of the masking, we don’t need to subtract
'0'
, since it will be masked away.
The trick
Putting it all together, to parse our 16-digit integer, we break it up into two
chunks of 8 bytes, run parse_8_chars
that we have just written, and benchmark it!
Not too shabby, we shaved almost 56% off of the unrolled loop benchmark! Still, it feels like we are manually doing a bunch of masking and elementwise operations. Maybe we can just let the CPU do all the hard work?
The SIMD trick
We have the main insight:
- Combine groups of digits simultaneously to achieve O(log(n)) time
We also have a 16-character, or 128-bit string to parse - can we use SIMD? Of course we can! SIMD stands for Single Instruction Multiple Data, and is exactly what we are looking for. SSE and AVX instructions are supported on both Intel and AMD CPUs, and they typically work with wider registers.
I used the Intel Intrinsics Guide to find the right compiler intrinsics for the right SIMD CPU instructions.
Let’s set up the digits in each of the 16 bytes first:
And now, the star of the show are the madd
functions. These SIMD functions do
exactly what we did with our bitmask tricks - they take a wide register,
interpret it as a vector of smaller integers, multiply each by a given
multiplier, and add neighboring ones together into a vector of wider integers.
All in one instruction!
As an example of taking every byte, multiplying the odd ones by 10 and adding
adjacent pairs together, we can use
_mm_maddubs_epi16
There is another instruction for the 2-byte trick, but unfortunately I could
not find one for the 4-byte trick - that needed two instructions. Here is the
completed parse_16_chars
:
0.75 nanoseconds! Beautiful.
Some commenters have complained that this is a toy problem, that there is no input validation or length checking, and that this would only work with a fixed length integer. For one thing, this can be used as a “fast path” when you know your integers are long, and fallback on a simple loop in other cases. Secondly, I bet with a few more clever SIMD instructions you can have something running under 2 nanoseconds complete with validation and length checking.
Conclusion
Compilers are absolutely amazing pieces of technology. They regularly surprise
me (or even blow my mind) at how well they can optimize code and see through
what I’m doing. While preparing the benchmarks for this post, I ran into a
problem - if the string I was trying to parse in the benchmarks was visible to
the compiler in the same compilation unit, no matter what I did, gcc and clang
would evaluate my solutions at compile time and put the final result in the
binary. Even the SIMD implementation! All my benchmarks would come out to be
equivalent to a single mov
instruction. I had to put the integer string in a
separate compilation unit, but I bet if I turned on LTO (Link Time
Optimization) the compiler would optimize that away too.
Having said all that, there is a culture of “optimization is the root of all evil”. That handwritten assembly or hand-optimization has no place anymore. That we should just blindly rely on our compilers. I think both positions are complementary - trust your compiler, trust you library vendor, but nothing beats carefully thought out code when you know your inputs and you’ve done your measurements to know it will make a difference.
Imagine your business revolved around parsing a firehose of telemetry data, and
you chose to use std::stringstream
. Would you buy more servers or spend a
little time optimizing your parsing?
Post Scriptum
- All benchmarks ran on a 3.8Ghz AMD Ryzen 3900X, on Linux, compiled with GCC 9.2
- Running on a Mac from 2014 wth an Intel processor and compiling with Clang, the non-SIMD trick actually ran slower than the naive loop. The SIMD trick was still fastest.
- I left out table-based parsing methods because I believe they would be slower than the SIMD method here, as they require more data cache usage, and frankly I was lazy to fill out those tables. If people believe the contrary, I can spend some time adding that to these benchmarks.
- You can find the code and the benchmarks here.
- If something is unclear please let me know in the comments - I will try to clarify.
- Reddit discussions can be found on r/programming and r/cpp
- Turns out the byteswap is totally unnecessary! It helped me gain the next insight when thinking through the problem, but the swapping can happen naturally as part of the horizontal additions. I will update the benchmark numbers and update this post in the next few days.
- Also turns out a smart fellow Wojciech Muła already thought of this! . I’m quite happy that we converged on the same methods - down to the same SSE instructions.