Consider that we have an array of 8-bit unsigned integers and we want to calculate their sum modulo 256.
In C++ we can do it like this:
|
|
When compiling with Clang 19.1.0 and flags -O3 -march=rocketlake -fno-unroll-loops
, we get the
following assembly:
|
|
The code is nicely vectorized, but that’s quite a lot of assembly (68 lines).
Looking closer at it, we can see that the ‘main’ loop is this one, which works on YMMWORD
-sized
chunks (256-bits worth of data or, in this case, 32 x 8-bit integers):
|
|
There’s also this part that adds up a XMMWORD
-sized chunk if len % 32 >= 16
1.
|
|
This part deals with the elements that remain after processing all YMMWORD
and XMMWORD
chunks
(i.e. it adds up the remaining len % 16
integers):
|
|
And last but not least, there’s also this branch that’s used for returning zero immediately whenever
we pass len == 0
to the function:
|
|
If we care about vectorization and minimizing code sizes (for example because we might want to do aggressive inlining, but at the same time we want to keep important things relatively close together in the instruction cache), then we can significantly reduce the assembly by imposing or exploiting a couple of constraints on our input. For example, let’s assume that:
- The length of the array that we want to sum up is always a multiple of 32 (the number of 8-bit
values that fit in a
YMM
register). This constraint can be a natural property of our input, or we can artificially add zero values as padding until we reach a length that is a multiple of 32. - The array is non-empty. By convention, we can say that the sum function must NOT be called on empty arrays.
We can express these constraints with
__builtin_unreachable()
2,
a compiler built-in which tells the compiler that a specific code location/code path is unreachable:
|
|
Let’s see where this gets us, with the same compiler/flags as before:
|
|
Pretty good. That’s the entire output. We’re down to 17 lines, with very little effort. In terms of
machine code size, we’re down from 222 bytes (in the initial unoptimized version) to 65 bytes (the
version above). This is because the compiler now knows it’s redundant to generate those branches
that deal with empty arrays, or those for processing an XMMWORD
chunk, or single elements. Since
we indicated that the entire array can be traversed in YMMWORD
-sized chunks, the compiler
disregards all other branches from the first version.
This may be a small and simple example, but generally speaking, you may be surprised how much performance/code size improvements you can squeeze out by simply exploiting your input’s constraints. And these things can eventually add up.
If you’ve identified a bottleneck, try reasoning about whether or not your input has any exploitable
properties. For example, if you’re doing a lot of work parsing integers with std::stoi()
3, but
your integers are actually between 0 and 100, try writing a custom parsing function that uses this
information and see if it improves the speed. Or, if you’re decoding some JSON data that necessarily
adheres to a certain schema, try NOT decoding that JSON data as an arbitrary JSON, because you
might end up paying a lot for various validations or storing metadata4 for abstract JSON types,
even though you already know your actual types before decoding.
What about GCC?
Often times this kind of optimizations won’t translate well (or not at all) when switching from one compiler to another.
In this case GCC does optimize the sum_with_constraints()
version as expected (i.e. it
eliminates all branches except for the YMMWORD
loop).
However, when we try using a std::vector
instead of a raw pointer and a length, things get messy:
|
|
The good news is that Clang is still capable of optimizing this, just like
sum_with_constraints()
. The bad news is that GCC isn’t capable. In this case,
__builtin_unreachable()
is ineffective (the same assembly gets generated when we omit those
__builtin_unreachable()
calls altogether).
We can try making the sum_vec_with_constraints()
function use sum_with_constraints()
, like this:
|
|
But even this does not work. I tried a few other ways of helping the compiler figure this out, but none of them were successful. If you give it a try and find something that works, let me know (see update5)!
I opened an issue on GCC’s tracker with this
problem, and it seems to be related to how expressions get optimized out in certain contexts. We
call __builtin_unreachable()
on a constraint for a certain expression, but that expression ends up
not being used in the actual code, from the perspective of the compiler after a few optimization
passes. Andrew Pinski described the
problem more concretely:
This is because
len
isvec.end() - vec.begin()
whereend()
is a pointer load.
And thendata + len
gets re-optimized to justvec.end()
.
And solen
is no longer taken into account in thestd::accumulate
loop.
This goes to show how:
- A seemingly trivial optimization pass can completely throw off another optimization pass, especially when you’re doing this kind of very fine-grained manual optimizations without dropping down to assembly.
- A ‘zero-cost’ abstraction (which is what some people may call
std::vector
, for example) can end up costing a lot more than zero compared to a more C-like counterpart.
I appreciate the GCC team’s quick responses to these reports. Hopefully they find a solution for this issue.
Source code:
sum()
: raw, Godboltsum_with_constraints()
: raw, Godboltsum_vec_with_constraints()
: raw, Godboltsum_vec_with_constraints_v2()
: raw, Godbolt
-
It only adds up 1
XMMWORD
-sized chunk, because 2 or more would constitute at least aYMMWORD
chunk (and those have already been processed at this point). ↩︎ -
If you prefer not using the compiler built-in, C23 has
unreachable()
, and C++23 hasstd::unreachable()
. ↩︎ -
Or the function in your language of choice that parses arbitrary 32-bit integers. This advice is not language-specific. ↩︎
-
E.g. union tags, array pointers/lengths. If according to some pre-defined schema you have a million JSON arrays filled with a single integer element each, you can end up paying more for the internal array representations in your language of choice, than for the actual array elements; instead, you could write a custom JSON decoder that only allocates a single array with a million elements worth of capacity, and append each integer to it while decoding the JSON; this avoids wasting a lot of space for the internal representations of one million arrays, assuming that’s how your non-schema-based JSON decoder would do it. ↩︎
-
@namniav commented on Twitter with a few solutions. You can use
asm volatile
,std::launder()
6, or avolatile
qualifier to basically make the compiler disconnectlen
fromvec.end() - vec.begin()
. After that, the__builtin_unreachable()
requirements will still hold in thestd::accumulate()
loop. Theasm volatile
approach produces the shortest assembly; the other ones have a bit of overhead. See the solutions here. ↩︎ -
Who even chooses these names? ↩︎