←HomeAbout→ Improving on std::count_if()'s auto-vectorization

The problem


Let’s consider the problem with the following description:

  1. We have an array of arbitrary length filled with uint8_t values;
  2. We want to count the number of even values in the array;
  3. Before doing the calculation we know that the number of even values in the array is between 0 and 2551.

A typical solution


To start off, we can leverage the STL for this calculation. std::count_if() more precisely:

1
2
3
4
5
6
7
8
auto count_even_values_v1(const std::vector<uint8_t> &vec)
{
    return std::count_if(
        vec.begin(),
        vec.end(),
        [](uint8_t x) { return x % 2 == 0; }
    );
}

Let’s take a look at the assembly (generated by Clang 19.1.0 with flags -O3 -march=rocketlake -fno-unroll-loops). This is the main loop which counts the even numbers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
.LCPI0_1:
    .byte   1
count_even_values_v1():
;   ............
    vpbroadcastb    xmm1, byte ptr [rip + .LCPI0_1]
.LBB0_6:
    vmovd   xmm2, dword ptr [rsi + rax]
    vpandn  xmm2, xmm2, xmm1
    vpmovzxbq       ymm2, xmm2
    vpaddq  ymm0, ymm0, ymm2
    add     rax, 4
    cmp     r8, rax
    jne     .LBB0_6

We can see that the assembly generated by Clang is vectorized, and iterates the array DWORD-by-DWORD (i.e. four 8-bit values at a time).

Let’s inspect the algorithm:

  1. First off, at line 5 in the snippet (vpbroadcastb xmm1, byte ptr [rip + .LCPI0_1]), we fill xmm1 with 16 x 8-bit masks, all equal to 0b1. So xmm1 will look like this:

    1
    
    xmm1[127:0] := [0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1, 0b1]
    

    Because we’re iterating DWORD-by-DWORD, only the first 32 bits of the xmm2 register will be filled with relevant data on each iteration, so we can disregard the rest which will be set to zero (for both xmm2 and xmm1). So we’ll consider xmm1 as only having 4 packed 8-bit masks, actually:

    1
    
    xmm1[31:0] := [0b1, 0b1, 0b1, 0b1]
    
  2. At line 7 (vmovd xmm2, dword ptr [rsi + rax]), we load a DWORD into the first 32 bits of the xmm2 register, i.e. 4 packed 8-bit integers. So xmm2 will look like this (where W, X, Y, Z represent the bit pattern of the first, second, third, and fourth of the loaded integers, respectively):

    1
    
    xmm2[31:0] := [0bZZZZZZZZ, 0bYYYYYYYY, 0bXXXXXXXX, 0bWWWWWWWW]
    
  3. At line 8 (vpandn xmm2, xmm2, xmm1), we do an AND NOT operation. vpandn DEST, SRC1, SRC2 is defined as DEST := NOT(SRC1) AND SRC2, which means:

    1
    
    xmm2[31:0] := [~0bZZZZZZZZ & 0b1, ~0bYYYYYYYY & 0b1, ~0bXXXXXXXX & 0b1, ~0bWWWWWWWW & 0b1]
    

    In simpler terms, this is effectively selecting the first bit from each of the packed values, and then flipping that bit. When a number is even, the first bit in its binary representation will be zero, and the result of the AND NOT operation will be 0b1. Similarly, for odd numbers, the result will be 0b0.

  4. At line 9 (vpmovzxbq ymm2, xmm2), we zero-extend the four 8-bit results of the AND NOT operation to four equivalent 64-bit results (i.e. for each packed value, we add 56 bits of zero-padding). R1, R2, R3, R4 stand for the first, second, third, and fourth result of the earlier AND NOT operation, respectively:

    1
    
    ymm2[255:0] := [zero-extended R4, zero-extended R3, zero-extended R2, zero-extended R1]
    
  5. At line 10 (vpaddq ymm0, ymm0, ymm2) we simply add the 4 results from the current iteration (ymm2) to the accumulator (ymm0):

    1
    2
    3
    4
    
    ymm0[63:0]    := ymm0[63:0]    + ymm2[63:0];
    ymm0[127:64]  := ymm0[127:64]  + ymm2[127:64];
    ymm0[191:128] := ymm0[191:128] + ymm2[191:128];
    ymm0[255:192] := ymm0[255:192] + ymm2[255:192];
    
  6. After the previous step, we either go back to step 1 if we still have DWORD-sized chunks to process, or we end the current loop and process chunks of lower size (not relevant for this investigation so we won’t go into further detail about those other branches).

Concrete example of running the algorithm

Consider we have the array [1, 2, 1, 2, 1, 2, 1, 2] as input, or [0b01, 0b10, 0b01, 0b10, 0b01, 0b10, 0b01, 0b10] in binary.

The individual steps are:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-> vpbroadcastb xmm1, byte ptr [rip + .LCPI0_1]
xmm1[31:0] := [0b1, 0b1, 0b1, 0b1]

Iteration 1:
   -> vmovd xmm2, dword ptr [rsi + rax]
   xmm2[31:0] := [0b10, 0b01, 0b10, 0b01]

   -> vpandn xmm2, xmm2, xmm1
   xmm2[31:0] := [~0b10 & 0b1, ~0b01 & 0b1, ~0b10 & 0b1, ~0b01 & 0b1]
              := [0b1, 0b0, 0b1, 0b0]

   -> vpmovzxbq ymm2, xmm2
   ymm2[255:0] := [0b00...1, 0b00..0, 0b00...1, 0b00..0]

   -> vpaddq  ymm0, ymm0, ymm2
   ymm0[63:0]    := ymm0[63:0]    + ymm2[63:0]    = 0;
   ymm0[127:64]  := ymm0[127:64]  + ymm2[127:64]  = 1;
   ymm0[191:128] := ymm0[191:128] + ymm2[191:128] = 0;
   ymm0[255:192] := ymm0[255:192] + ymm2[255:192] = 1;

Iteration 2:
   -> vmovd xmm2, dword ptr [rsi + rax]
   xmm2[31:0] := [0b10, 0b01, 0b10, 0b01]

   -> vpandn xmm2, xmm2, xmm1
   xmm2[31:0] := [~0b10 & 0b1, ~0b01 & 0b1, ~0b10 & 0b1, ~0b01 & 0b1]
              := [0b1, 0b0, 0b1, 0b0]

   -> vpmovzxbq ymm2, xmm2
   ymm2[255:0] := [0b00...1, 0b00..0, 0b00...1, 0b00..0]

   -> vpaddq  ymm0, ymm0, ymm2
   ymm0[63:0]    := ymm0[63:0]    + ymm2[63:0]    = 0;
   ymm0[127:64]  := ymm0[127:64]  + ymm2[127:64]  = 2;
   ymm0[191:128] := ymm0[191:128] + ymm2[191:128] = 0;
   ymm0[255:192] := ymm0[255:192] + ymm2[255:192] = 2;

So at the end we have two partial results equal to 0, and two partial results equal to 2. Adding them up, we get the final answer 4.


Improving the typical solution


Now, with all of this in mind, you might be wondering: why are we zero-extending those four 8-bit values to four 64-bit values?

The answer is: because the type of the accumulator that std::count_if() uses is the difference_type of the iterator that we’re passing2. In this case, the type of the accumulator will be long, a 64-bit integer type (on the platform that we’re targeting). This effectively means that whenever we want to increment the accumulator, we’re telling the compiler that we require 64-bit precision. Otherwise, if the compiler were to generate the increments on 32-bit integers (or smaller), the result might overflow or wrap around, which would lead to a wrong answer at the end of the calculation (so the compiler is not allowed to do that).

However, considering the constraint that was specified in the problem description (the total number of even values in the array is between 0 and 255), we know that we do NOT need 64-bit precision when doing the increments. In our case, 8 bits will suffice. Therefore, the type of the accumulator can be uint8_t, same as the array’s element type.

Let’s then write our own std::count_if() implementation which lets us control the type of the accumulator (regardless of the difference_type of the iterator):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
template<typename Acc, typename It, typename Pred>
Acc custom_count_if(It begin, It end, Pred pred)
{
    Acc result = 0;
    for (auto it = begin; it != end; it++)
    {
        if (pred(*it))
            result++;
    }
    return result;
}

We can then use this custom implementation as follows:

1
2
3
4
5
6
7
8
auto count_even_values_v2(const std::vector<uint8_t> &vec)
{
    return custom_count_if<uint8_t>(
        vec.begin(),
        vec.end(),
        [](uint8_t x) { return x % 2 == 0; }
    );
}

Let’s check the assembly of the main loop again:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
.LCPI0_2:
    .byte   1
count_even_values_v2():
;   ............
    vpbroadcastb    ymm1, byte ptr [rip + .LCPI0_2]
.LBB0_11:
    vmovdqu ymm2, ymmword ptr [rdx + rax]
    vpandn  ymm2, ymm2, ymm1
    vpaddb  ymm0, ymm2, ymm0
    add     rax, 32
    cmp     rdi, rax
    jne     .LBB0_11

Great. We went from iterating with DWORD-sized chunks (4 x 8-bit values) in the previous version, to iterating with YMMWORD-sized chunks (32 x 8-bit values) in the current version3. That’s because those partial results packed in the YMM accumulator aren’t filled up with useless zero-padding anymore, so we can use the now-available bit positions to pack 32 partial results, not just 4.

Let’s do some benchmarks4 with libbenchmark to see if we’ve actually improved anything significantly.

I am taking 2^10, 2^12, ..., 2^30 random uint8_t values, calling the first version and the second version of the function to get the count of even numbers, and then comparing the results. Here’s the (lightly) edited benchmark output5. Note that the times are in nanoseconds (lower is better), and v1 represents count_even_values_v1() and v2 represents count_even_values_v2():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
$ clang++ benchmark.cpp -std=c++23 -O3 -fno-unroll-loops -march=native -lbenchmark 
$ ./a.out
Benchmark                   Time_V1      Time_V2     Relative speedup
---------------------------------------------------------------------
[v1 vs. v2]/1024            145          25          5.8x
[v1 vs. v2]/4096            556          77          7.2x
[v1 vs. v2]/16384           2220         287         7.7x
[v1 vs. v2]/65536           8572         1286        6.6x
[v1 vs. v2]/262144          34280        6034        5.6x
[v1 vs. v2]/1048576         137068       25245       5.4x
[v1 vs. v2]/4194304         548051       102163      5.3x
[v1 vs. v2]/16777216        2379866      918532      2.5x
[v1 vs. v2]/67108864        9729229      3804606     2.5x
[v1 vs. v2]/268435456       39623747     15267566    2.5x
[v1 vs. v2]/1073741824      157596298    61126020    2.5x

If we allow loop unrolling, these are the results6:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
$ clang++ benchmark.cpp -std=c++23 -O3 -march=native -lbenchmark 
$ ./a.out
Benchmark                   Time_V1      Time_V2     Relative speedup
---------------------------------------------------------------------
[v1 vs. v2]/1024            105          12          8.7x
[v1 vs. v2]/4096            425          48          8.8x
[v1 vs. v2]/16384           1703         178         9.5x
[v1 vs. v2]/65536           6804         950         7.1x
[v1 vs. v2]/262144          27233        5421        5.0x
[v1 vs. v2]/1048576         109215       25116       4.3x
[v1 vs. v2]/4194304         436428       101143      4.3x
[v1 vs. v2]/16777216        1900205      926924      2.0x
[v1 vs. v2]/67108864        7836094      3841985     2.0x
[v1 vs. v2]/268435456       31811774     15366099    2.0x
[v1 vs. v2]/1073741824      126397482    61204442    2.0x

So for our workload, the second version wins by a pretty large margin. At best it’s ~9.5 times faster and at worst it’s ~2.0 times faster. This is yet another example of exploiting your input’s constraints in order to make significant performance improvements.

Note: This optimization works out for other element types, too. For example, if we had an array filled with uint32_t values and we had to count the even ones, the fastest solution would be the one that uses uint32_t as the type of the accumulator (not long, nor uint8_t).

Bonus: GCC regression


If you’re using a GCC version between 9.1 and 14.2 (most recent release as of March 8, 2025), there’s an interesting regression that has only been fixed after 5 years. The regression is that GCC 8.5 used to be able to vectorize count_even_values_v1(), the version that uses std::count_if(), but later versions until GCC 15.0 aren’t able to, and generate assembly that traverses the array one byte (uint8_t) at a time.

I say ‘interesting regression’ because GCC fails to vectorize the loop for even/odd checks. If you change the condition to x % 3 == 0 or x % 4201337 == 0, the loop is vectorized (although the vectorization is poorer than Clang’s). I reported this issue (#117776) some time ago and it was fixed in the trunk. It won’t be backported though, so you’ll only have the fix starting with GCC 15.0.

Thanks for reading!

Source code:


  1. Another constraint that would allow the same optimizations would be to say that we don’t know the range of the answer beforehand, but that we want the answer modulo 256. Similar constraints for larger types, e.g. uint16_t or uint32_t, also enable similar optimizations (we could say that we have uint16_t / uint32_t data and we know that the answer is between 0 and 216 - 1 or 232 - 1, respectively). ↩︎

  2. std::count_if() page on cppreference↩︎

  3. If you change the signature of the first version to uint8_t count_even_values_v1(const std::vector<uint8_t>&) (i.e. you return uint8_t instead of auto), Clang is smart enough to basically interpret that as using a uint8_t accumulator in the first place, and thus generates identical assembly to count_even_values_v2(). However, GCC is NOT smart enough to do this, and the signature change has no effect (so it does vectorize the code, but poorly). Generally, I’d rather be explicit and not rely on those implicit/explicit conversions to be recognized and used appropriately by the optimizer. Thanks to @total_order_ for commenting with a Rust solution on Reddit that basically does what I described in this footnote (I’m guessing it comes down to the same LLVM optimization pass). ↩︎

  4. CPU specs↩︎

  5. Raw output↩︎

  6. Raw output with inlining allowed↩︎

  7. I’m using std::span for those functions instead of std::vector because I’m only generating one global vector, but this doesn’t affect the measurements. ↩︎