←HomeAbout→ A Clang regression related to switch statements and inlining

After my previous post, Eliminating redundant bound checks (read it for context if you haven’t already), I wanted to do a benchmark using the ‘optimized’ version of the increment() function, which didn’t contain any bound checks when compiled with Clang, even though we used .at() for indexing into the array. Here’s that last version of the function from the previous post (I made it static because I’m using a single compilation unit):

1
2
3
4
5
static void increment(std::array<uint8_t, 1024> &data, uint8_t first_idx)
{
    size_t second_idx = first_idx_to_second_idx(first_idx);
    data.at(second_idx)++;
}

My benchmark goes like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
int main()
{
    std::vector<uint8_t> indexes = generate(); // Generate 300 million indexes.

    auto t0 = std::chrono::high_resolution_clock::now();

    std::array<uint8_t, 1024> arr = {};
    for (auto idx : indexes)
        increment(arr, idx);

    auto t1 = std::chrono::high_resolution_clock::now();

    // Use the results from increment() so the function call is not optimized away.
    auto sum = std::accumulate(arr.begin(), arr.end(), 0);

    // Print the elapsed time and the sum.
    auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0);
    std::println("elapsed: {} sum: {}", dur, sum);
}

I generate 300 million instances for first_idx, and then I call increment() with them. In order to avoid inlining messing with the results, I also benchmarked a version of increment() that is not inlined:

1
2
3
4
5
6
__attribute__((noinline))
static void increment_non_inlined(std::array<uint8_t, 1024> &data, uint8_t first_idx)
{
    size_t second_idx = first_idx_to_second_idx(first_idx);
    data.at(second_idx)++;
}

The results were as follows:

1
2
3
4
5
6
$ clang++ inlined.cpp -stdlib=libc++ -std=c++23 -O3 -march=native
$ ./a.out
elapsed: 2810ms sum: 27392
$ clang++ non-inlined.cpp -stdlib=libc++ -std=c++23 -O3 -march=native
$ ./a.out
elapsed: 547ms sum: 29696

That’s rather… unexpected. Allowing the function to be inlined, and thus giving it more context regarding the way it’s called, is making the performance worse, not better. And not just by a little, it’s roughly 5 times slower.

What’s causing the issue? Well, the behavior of the increment() function vastly changes when it’s inlined compared to non-inlined. Let’s look at the assembly.

The non-inlined version does what we saw in the previous post. It gets the second_idx by indexing into a lookup table1 with first_idx, and then it increments the integer at that position.

1
2
3
4
5
6
7
8
9
increment:
        add     sil, -128
        movzx   eax, sil
      # size_t second_idx = first_idx_to_second_idx(first_idx);
        lea     rcx, [rip + .Lswitch.table.increment]
        mov     rax, qword ptr [rcx + 8*rax]
      # data.at(second_idx)++;
        inc     byte ptr [rdi + rax]
        ret

On the other hand, the inlined version does something very bad for performance. Every case in the switch has a different branch (or label), and each one will feed into a main branch that does the actual increment (.LBB0_266):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
.LBB0_11:
        lea     rcx, [rsp + 561] # case 1: return 553;
        jmp     .LBB0_266
.LBB0_12:
        lea     rcx, [rsp + 549] # case 2: return 541;
        jmp     .LBB0_266
#       ...
.LBB0_266:
        inc     byte ptr [rcx] # data.at(second_idx)++;
        inc     r12
        cmp     r12, 268435456
        je      .LBB0_8
        movzx   ecx, byte ptr [rbx + r12] # Get the next first_idx.
        movsxd  rdx, dword ptr [rax + 4*rcx] # Get the next switch case label.
        add     rdx, rax
        mov     rcx, rbp
        jmp     rdx # Jump to the switch case label of the current first_idx value.

Since the indexes are randomly generated, the branch predictor won’t be able to accurately predict to which of those switch case labels to jump on each iteration. If the RNG that we use is good, we’d expect the branch predictor to miss on almost every iteration. We can verify this with perf stat:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
$ sudo perf stat ./a.out # increment()
elapsed: 2815ms sum: 27136

 Performance counter stats for './a.out':
        ...
        4809459795      branches      # 542.294 M/sec
         299087225      branch-misses # 6.22% of all branches

$ sudo perf stat ./a.out # increment_non_inlined()
elapsed: 546ms sum: 27648

 Performance counter stats for './a.out':
        ...
        4804843199      branches      # 735.653 M/sec
             79808      branch-misses # 0.00% of all branches

So the inlined version generates 299087225 branch misses, while the non-inlined version generates 79808. The number 299087225 is pretty close to the total number of first_idx‘es that we generate (300 million), so the expectation seems to be correct2. No wonder the performance tanks so hard.

Next, I wanted to check if this botched optimization could be fixed by some flags given to Clang. The zig toolchain has some defaults which have given me better performance in the past when simply replacing clang++ with zig c++ in the compilation command. Testing the inlined version which performed badly with clang++:

1
2
$ zig c++ inlined.cpp -std=c++23 -O3 -march=native && ./a.out
elapsed: 243ms sum: 28928

That did it, it’s now 11 times faster than what we compiled with Clang.

I have Zig 13 locally, which uses Clang 18.1.6. My previous clang++ commands were using version 19.1.6. After trying to add some flags (that the zig c++ toolchain sets) to the previous clang++ command without managing to improve the performance, I was thinking that this may just be a regression in Clang 19. Testing with Clang 18.1.8:

1
2
3
$ /tmp/clang-18/bin/clang++ inlined.cpp -stdlib=libc++ -std=c++23 -O3 -march=native
$ ./a.out
elapsed: 239ms sum: 27648

So indeed, this seems to be a regression (at the very least, for the purposes of this use case). I opened an issue on Clang/LLVM’s repo; you can find it here if you’re interested in this problem’s conclusion. Someone already responded saying that it’s a trade-off between two optimization passes called SROA (scalar replacement of aggregates) and SimplifyCFG (simplify control flow graph). Remains to be seen if this was intentional and/or if they want to do something about it going forward.

You can see the benchmark code here, or play around with a smaller reproduced version on Godbolt.


  1. Note: this lookup table is not related to the TABLE variable that we used in the first version of the increment() function in the previous post. This lookup table is the assembly representation that Clang generates for the big switch statement. ↩︎

  2. I also ran the same code but with 100 million first_idx‘es instead of 300 million, just to verify that those ~300 million branch misses we saw are directly related to the number of increment() invocations, and not to something else. In this case the number of branch misses was roughly 100 million, which further confirms the original expectation. ↩︎