Getting rid of unwanted branches with __builtin

Consider that we have an array of 8-bit unsigned integers and we want to calculate their sum modulo 256.

In C++ we can do it like this:

1
2
3
4


uint8_t sum(const uint8_t *data, size_t len)
{
    return std::accumulate(data, data + len, uint8_t(0));
}

When compiling with Clang 19.1.0 and flags -O3 -march=rocketlake -fno-unroll-loops, we get the following assembly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68


sum():
        test    rsi, rsi
        je      .LBB0_1
        cmp     rsi, 16
        jae     .LBB0_4
        xor     eax, eax
        mov     rdx, rdi
        jmp     .LBB0_15
.LBB0_1:
        xor     eax, eax
        ret
.LBB0_4:
        cmp     rsi, 32
        jae     .LBB0_10
        xor     eax, eax
        xor     ecx, ecx
.LBB0_6:
        mov     r8, rsi
        and     r8, -16
        lea     rdx, [rdi + r8]
        movzx   eax, al
        vmovd   xmm0, eax
.LBB0_7:
        vpaddb  xmm0, xmm0, xmmword ptr [rdi + rcx]
        add     rcx, 16
        cmp     r8, rcx
        jne     .LBB0_7
        vpshufd xmm1, xmm0, 238
        vpaddb  xmm0, xmm0, xmm1
        vpxor   xmm1, xmm1, xmm1
        vpsadbw xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        cmp     r8, rsi
        jne     .LBB0_15
        jmp     .LBB0_9
.LBB0_10:
        mov     rcx, rsi
        and     rcx, -32
        vpxor   xmm0, xmm0, xmm0
        xor     eax, eax
.LBB0_11:
        vpaddb  ymm0, ymm0, ymmword ptr [rdi + rax]
        add     rax, 32
        cmp     rcx, rax
        jne     .LBB0_11
        vextracti128    xmm1, ymm0, 1
        vpaddb  xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 238
        vpaddb  xmm0, xmm0, xmm1
        vpxor   xmm1, xmm1, xmm1
        vpsadbw xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        cmp     rcx, rsi
        je      .LBB0_9
        test    sil, 16
        jne     .LBB0_6
        add     rcx, rdi
        mov     rdx, rcx
.LBB0_15:
        add     rdi, rsi
.LBB0_16:
        add     al, byte ptr [rdx]
        inc     rdx
        cmp     rdx, rdi
        jne     .LBB0_16
.LBB0_9:
        vzeroupper
        ret

The code is nicely vectorized, but that’s quite a lot of assembly (68 lines).

Looking closer at it, we can see that the ‘main’ loop is this one, which works on YMMWORD-sized chunks (256-bits worth of data or, in this case, 32 x 8-bit integers):

1
2
3
4
5


.LBB0_11:
        vpaddb  ymm0, ymm0, ymmword ptr [rdi + rax]
        add     rax, 32
        cmp     rcx, rax
        jne     .LBB0_11

There’s also this part that adds up a XMMWORD-sized chunk if len % 32 >= 16¹.

1
2
3
4
5


.LBB0_7:
        vpaddb  xmm0, xmm0, xmmword ptr [rdi + rcx]
        add     rcx, 16
        cmp     r8, rcx
        jne     .LBB0_7

This part deals with the elements that remain after processing all YMMWORD and XMMWORD chunks (i.e. it adds up the remaining len % 16 integers):

1
2
3
4
5


.LBB0_16:
        add     al, byte ptr [rdx]
        inc     rdx
        cmp     rdx, rdi
        jne     .LBB0_16

And last but not least, there’s also this branch that’s used for returning zero immediately whenever we pass len == 0 to the function:

1
2
3
4
5
6
7


sum():
        test    rsi, rsi
        je      .LBB0_1
;       ................
.LBB0_1:
        xor     eax, eax
        ret

If we care about vectorization and minimizing code sizes (for example because we might want to do aggressive inlining, but at the same time we want to keep important things relatively close together in the instruction cache), then we can significantly reduce the assembly by imposing or exploiting a couple of constraints on our input. For example, let’s assume that:

The length of the array that we want to sum up is always a multiple of 32 (the number of 8-bit values that fit in a YMM register). This constraint can be a natural property of our input, or we can artificially add zero values as padding until we reach a length that is a multiple of 32.
The array is non-empty. By convention, we can say that the sum function must NOT be called on empty arrays.

We can express these constraints with __builtin_unreachable()², a compiler built-in which tells the compiler that a specific code location/code path is unreachable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


uint8_t sum_with_constraints(const uint8_t *data, size_t len)
{
    constexpr size_t U8_VALUES_PER_YMMWORD = 32;
    if (len % U8_VALUES_PER_YMMWORD != 0)
        __builtin_unreachable(); // `len` is always a multiple of 32.
    if (len == 0)
        __builtin_unreachable(); // `len` is never zero.

    return std::accumulate(data, data + len, uint8_t(0));
}

Let’s see where this gets us, with the same compiler/flags as before:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


sum_with_constraints():
        vpxor   xmm0, xmm0, xmm0
        xor     eax, eax
.LBB0_1:
        vpaddb  ymm0, ymm0, ymmword ptr [rdi + rax]
        add     rax, 32
        cmp     rsi, rax
        jne     .LBB0_1
        vextracti128    xmm1, ymm0, 1
        vpaddb  xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 238
        vpaddb  xmm0, xmm0, xmm1
        vpxor   xmm1, xmm1, xmm1
        vpsadbw xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        vzeroupper
        ret

Pretty good. That’s the entire output. We’re down to 17 lines, with very little effort. In terms of machine code size, we’re down from 222 bytes (in the initial unoptimized version) to 65 bytes (the version above). This is because the compiler now knows it’s redundant to generate those branches that deal with empty arrays, or those for processing an XMMWORD chunk, or single elements. Since we indicated that the entire array can be traversed in YMMWORD-sized chunks, the compiler disregards all other branches from the first version.

This may be a small and simple example, but generally speaking, you may be surprised how much performance/code size improvements you can squeeze out by simply exploiting your input’s constraints. And these things can eventually add up.

If you’ve identified a bottleneck, try reasoning about whether or not your input has any exploitable properties. For example, if you’re doing a lot of work parsing integers with std::stoi()³, but your integers are actually between 0 and 100, try writing a custom parsing function that uses this information and see if it improves the speed. Or, if you’re decoding some JSON data that necessarily adheres to a certain schema, try NOT decoding that JSON data as an arbitrary JSON, because you might end up paying a lot for various validations or storing metadata⁴ for abstract JSON types, even though you already know your actual types before decoding.

What about GCC?

Often times this kind of optimizations won’t translate well (or not at all) when switching from one compiler to another.

In this case GCC does optimize the sum_with_constraints() version as expected (i.e. it eliminates all branches except for the YMMWORD loop).

However, when we try using a std::vector instead of a raw pointer and a length, things get messy:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


uint8_t sum_vec_with_constraints(const std::vector<uint8_t> &vec)
{
    constexpr size_t U8_VALUES_PER_YMMWORD = 32;
    if (vec.size() % U8_VALUES_PER_YMMWORD != 0)
        __builtin_unreachable(); // `len` is always a multiple of 32.
    if (vec.size() == 0)
        __builtin_unreachable(); // `len` is never zero.

    return std::accumulate(vec.begin(), vec.end(), uint8_t(0));
}

The good news is that Clang is still capable of optimizing this, just like sum_with_constraints(). The bad news is that GCC isn’t capable. In this case, __builtin_unreachable() is ineffective (the same assembly gets generated when we omit those __builtin_unreachable() calls altogether).

We can try making the sum_vec_with_constraints() function use sum_with_constraints(), like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


__always_inline
uint8_t sum_with_constraints(const uint8_t *data, size_t len)
{
    constexpr size_t U8_VALUES_PER_YMMWORD = 32;
    if (len % U8_VALUES_PER_YMMWORD != 0)
        __builtin_unreachable(); // `len` is always a multiple of 32.
    if (len == 0)
        __builtin_unreachable(); // `len` is never zero.

    return std::accumulate(data, data + len, uint8_t(0));
}

uint8_t sum_vec_with_constraints_v2(const std::vector<uint8_t> &vec)
{
    const uint8_t *data = vec.begin().base();
    const size_t   len  = vec.size();
    return sum_with_constraints(data, len);
}

But even this does not work. I tried a few other ways of helping the compiler figure this out, but none of them were successful. If you give it a try and find something that works, let me know (see update⁵)!

I opened an issue on GCC’s tracker with this problem, and it seems to be related to how expressions get optimized out in certain contexts. We call __builtin_unreachable() on a constraint for a certain expression, but that expression ends up not being used in the actual code, from the perspective of the compiler after a few optimization passes. Andrew Pinski described the problem more concretely:

This is because len is vec.end() - vec.begin() where end() is a pointer load.
And then data + len gets re-optimized to just vec.end().
And so len is no longer taken into account in the std::accumulate loop.

This goes to show how:

A seemingly trivial optimization pass can completely throw off another optimization pass, especially when you’re doing this kind of very fine-grained manual optimizations without dropping down to assembly.
A ‘zero-cost’ abstraction (which is what some people may call std::vector, for example) can end up costing a lot more than zero compared to a more C-like counterpart.

I appreciate the GCC team’s quick responses to these reports. Hopefully they find a solution for this issue.

Source code:

sum(): raw, Godbolt
sum_with_constraints(): raw, Godbolt
sum_vec_with_constraints(): raw, Godbolt
sum_vec_with_constraints_v2(): raw, Godbolt

It only adds up 1 XMMWORD-sized chunk, because 2 or more would constitute at least a YMMWORD chunk (and those have already been processed at this point). ↩︎
If you prefer not using the compiler built-in, C23 has unreachable(), and C++23 has std::unreachable(). ↩︎
Or the function in your language of choice that parses arbitrary 32-bit integers. This advice is not language-specific. ↩︎
E.g. union tags, array pointers/lengths. If according to some pre-defined schema you have a million JSON arrays filled with a single integer element each, you can end up paying more for the internal array representations in your language of choice, than for the actual array elements; instead, you could write a custom JSON decoder that only allocates a single array with a million elements worth of capacity, and append each integer to it while decoding the JSON; this avoids wasting a lot of space for the internal representations of one million arrays, assuming that’s how your non-schema-based JSON decoder would do it. ↩︎
@namniav commented on Twitter with a few solutions. You can use asm volatile, std::launder()⁶, or a volatile qualifier to basically make the compiler disconnect len from vec.end() - vec.begin(). After that, the __builtin_unreachable() requirements will still hold in the std::accumulate() loop. The asm volatile approach produces the shortest assembly; the other ones have a bit of overhead. See the solutions here. ↩︎
Who even chooses these names? ↩︎

←HomeAbout→ Getting rid of unwanted branches with __builtin_unreachable()

What about GCC?