The problem
Let’s consider the problem with the following description:
- We have an array of arbitrary length filled with
uint8_t
values; - We want to count the number of even values in the array;
- Before doing the calculation we know that the number of even values in the array is between 0 and 2551.
A typical solution
To start off, we can leverage the STL for this calculation. std::count_if()
more precisely:
|
|
Let’s take a look at the assembly (generated by Clang 19.1.0 with flags -O3 -march=rocketlake -fno-unroll-loops
). This is the main loop which counts the even numbers:
|
|
We can see that the assembly generated by Clang is vectorized, and iterates the array
DWORD
-by-DWORD
(i.e. four 8-bit values at a time).
Let’s inspect the algorithm:
-
First off, at line 5 in the snippet (
vpbroadcastb xmm1, byte ptr [rip + .LCPI0_1]
), we fillxmm1
with 16 x 8-bit masks, all equal to0b1
. Soxmm1
will look like this:1
xmm1[127:0] := [0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1]
Because we’re iterating
DWORD
-by-DWORD
, only the first 32 bits of thexmm2
register will be filled with relevant data on each iteration, so we can disregard the rest which will be set to zero (for bothxmm2
andxmm1
). So we’ll considerxmm1
as only having 4 packed 8-bit masks, actually:1
xmm1[31:0] := [0b1, 0b1, 0b1, 0b1]
-
At line 7 (
vmovd xmm2, dword ptr [rsi + rax]
), we load aDWORD
into the first 32 bits of thexmm2
register, i.e. 4 packed 8-bit integers. Soxmm2
will look like this (whereW
,X
,Y
,Z
represent the bit pattern of the first, second, third, and fourth of the loaded integers, respectively):1
xmm2[31:0] := [0bZZZZZZZZ, 0bYYYYYYYY, 0bXXXXXXXX, 0bWWWWWWWW]
-
At line 8 (
vpandn xmm2, xmm2, xmm1
), we do anAND NOT
operation.vpandn DEST, SRC1, SRC2
is defined asDEST := NOT(SRC1) AND SRC2
, which means:1
xmm2[31:0] := [~0bZZZZZZZZ & 0b1, ~0bYYYYYYYY & 0b1, ~0bXXXXXXXX & 0b1, ~0bWWWWWWWW & 0b1]
In simpler terms, this is effectively selecting the first bit from each of the packed values, and then flipping that bit. When a number is even, the first bit in its binary representation will be zero, and the result of the
AND NOT
operation will be0b1
. Similarly, for odd numbers, the result will be0b0
. -
At line 9 (
vpmovzxbq ymm2, xmm2
), we zero-extend the four 8-bit results of theAND NOT
operation to four equivalent 64-bit results (i.e. for each packed value, we add 56 bits of zero-padding).R1
,R2
,R3
,R4
stand for the first, second, third, and fourth result of the earlierAND NOT
operation, respectively:1
ymm2[255:0] := [zero-extended R4, zero-extended R3, zero-extended R2, zero-extended R1]
-
At line 10 (
vpaddq ymm0, ymm0, ymm2
) we simply add the 4 results from the current iteration (ymm2
) to the accumulator (ymm0
):1 2 3 4
ymm0[63:0] := ymm0[63:0] + ymm2[63:0]; ymm0[127:64] := ymm0[127:64] + ymm2[127:64]; ymm0[191:128] := ymm0[191:128] + ymm2[191:128]; ymm0[255:192] := ymm0[255:192] + ymm2[255:192];
-
After the previous step, we either go back to step 1 if we still have
DWORD
-sized chunks to process, or we end the current loop and process chunks of lower size (not relevant for this investigation so we won’t go into further detail about those other branches).
Concrete example of running the algorithm
Consider we have the array [1, 2, 1, 2, 1, 2, 1, 2]
as input, or [0b01, 0b10, 0b01, 0b10, 0b01, 0b10, 0b01, 0b10]
in binary.
The individual steps are:
|
|
So at the end we have two partial results equal to 0, and two partial results equal to 2. Adding them up, we get the final answer 4.
Improving the typical solution
Now, with all of this in mind, you might be wondering: why are we zero-extending those four 8-bit values to four 64-bit values?
The answer is: because the type of the accumulator that std::count_if()
uses is the
difference_type
of the iterator that we’re passing2. In this case, the type of the accumulator will
be long
, a 64-bit integer type (on the platform that we’re targeting). This effectively means that
whenever we want to increment the accumulator, we’re telling the compiler that we require 64-bit
precision. Otherwise, if the compiler were to generate the increments on 32-bit integers (or
smaller), the result might overflow or wrap around, which would lead to a wrong answer at the end of
the calculation (so the compiler is not allowed to do that).
However, considering the constraint that was specified in the problem description (the total number
of even values in the array is between 0 and 255), we know that we do NOT need 64-bit precision
when doing the increments. In our case, 8 bits will suffice. Therefore, the type of the accumulator
can be uint8_t
, same as the array’s element type.
Let’s then write our own std::count_if()
implementation which lets us control the type of the
accumulator (regardless of the difference_type
of the iterator):
|
|
We can then use this custom implementation as follows:
|
|
Let’s check the assembly of the main loop again:
|
|
Great. We went from iterating with DWORD
-sized chunks (4 x 8-bit values) in the previous version,
to iterating with YMMWORD
-sized chunks (32 x 8-bit values) in the current version3. That’s
because those partial results packed in the YMM
accumulator aren’t filled up with useless
zero-padding anymore, so we can use the now-available bit positions to pack 32 partial results,
not just 4.
Let’s do some benchmarks4 with libbenchmark
to see if we’ve
actually improved anything significantly.
I am taking 2^10, 2^12, ..., 2^30
random uint8_t
values, calling the first version and the
second version of the function to get the count of even numbers, and then comparing the results.
Here’s the (lightly) edited benchmark output5. Note that the times are in nanoseconds (lower is
better), and v1
represents count_even_values_v1()
and v2
represents count_even_values_v2()
:
|
|
If we allow loop unrolling, these are the results6:
|
|
So for our workload, the second version wins by a pretty large margin. At best it’s ~9.5 times faster and at worst it’s ~2.0 times faster. This is yet another example of exploiting your input’s constraints in order to make significant performance improvements.
Note: This optimization works out for other element types, too. For example, if we had an array filled
with uint32_t
values and we had to count the even ones, the fastest solution would be the one that
uses uint32_t
as the type of the accumulator (not long
, nor uint8_t
).
Bonus: GCC regression
If you’re using a GCC version between 9.1 and 14.2 (most recent release as of March 8, 2025),
there’s an interesting regression that has only been fixed after 5 years. The regression is that GCC
8.5 used to be able to vectorize count_even_values_v1()
, the version that uses std::count_if()
,
but later versions until GCC 15.0 aren’t able to, and generate assembly that traverses the array one
byte (uint8_t
) at a time.
I say ‘interesting regression’ because GCC fails to vectorize the loop for even/odd checks. If you
change the condition to x % 3 == 0
or x % 4201337 == 0
, the loop is vectorized (although the
vectorization is poorer than Clang’s). I reported this issue
(#117776) some time ago and it was fixed in
the trunk. It won’t be backported though, so you’ll only have the fix starting with GCC 15.0.
Thanks for reading!
Source code:
count_even_values_v1()
: raw, Godboltcount_even_values_v2()
: raw, Godboltbenchmark.cpp
7: raw, Godbolt
-
Another constraint that would allow the same optimizations would be to say that we don’t know the range of the answer beforehand, but that we want the answer modulo 256. Similar constraints for larger types, e.g.
uint16_t
oruint32_t
, also enable similar optimizations (we could say that we haveuint16_t
/uint32_t
data and we know that the answer is between 0 and 216 - 1 or 232 - 1, respectively). ↩︎ -
If you change the signature of the first version to
uint8_t count_even_values_v1(const std::vector<uint8_t>&)
(i.e. you returnuint8_t
instead ofauto
), Clang is smart enough to basically interpret that as using auint8_t
accumulator in the first place, and thus generates identical assembly tocount_even_values_v2()
. However, GCC is NOT smart enough to do this, and the signature change has no effect (so it does vectorize the code, but poorly). Generally, I’d rather be explicit and not rely on those implicit/explicit conversions to be recognized and used appropriately by the optimizer. Thanks to @total_order_ for commenting with a Rust solution on Reddit that basically does what I described in this footnote (I’m guessing it comes down to the same LLVM optimization pass). ↩︎ -
I’m using
std::span
for those functions instead ofstd::vector
because I’m only generating one global vector, but this doesn’t affect the measurements. ↩︎