After my previous post, Eliminating redundant bound checks
(read it for context if you haven’t already), I wanted to do a benchmark using the ‘optimized’
version of the increment()
function, which didn’t contain any bound checks when compiled with
Clang, even though we used .at()
for indexing into the array. Here’s that last version of the
function from the previous post (I made it static
because I’m using a single compilation unit):
|
|
My benchmark goes like this:
|
|
I generate 300 million instances for first_idx
, and then I call increment()
with them. In order to
avoid inlining messing with the results, I also benchmarked a version of increment()
that is not
inlined:
|
|
The results were as follows:
|
|
That’s rather… unexpected. Allowing the function to be inlined, and thus giving it more context regarding the way it’s called, is making the performance worse, not better. And not just by a little, it’s roughly 5 times slower.
What’s causing the issue? Well, the behavior of the increment()
function vastly changes when it’s
inlined compared to non-inlined. Let’s look at the assembly.
The non-inlined version does what we saw in the previous post. It gets the second_idx
by
indexing into a lookup table1 with first_idx
, and then it increments the integer at that
position.
|
|
On the other hand, the inlined version does something very bad for performance. Every case in
the switch
has a different branch (or label), and each one will feed into a main branch that does
the actual increment (.LBB0_266
):
|
|
Since the indexes are randomly generated, the branch predictor won’t be able to accurately predict
to which of those switch
case labels to jump on each iteration. If the RNG that we use is good,
we’d expect the branch predictor to miss on almost every iteration. We can verify this with perf stat
:
|
|
So the inlined version generates 299087225
branch misses, while the non-inlined version
generates 79808
. The number 299087225
is pretty close to the total number of first_idx
‘es that
we generate (300 million), so the expectation seems to be correct2. No wonder the performance
tanks so hard.
Next, I wanted to check if this botched optimization could be fixed by some flags given to Clang.
The zig
toolchain has some defaults which have given me better
performance in the past when simply replacing clang++
with zig c++
in the compilation command.
Testing the inlined version which performed badly with clang++
:
|
|
That did it, it’s now 11 times faster than what we compiled with Clang.
I have Zig 13 locally, which uses Clang 18.1.6. My previous clang++
commands were using version
19.1.6. After trying to add some flags (that the zig c++
toolchain sets) to the previous clang++
command without managing to improve the performance, I was thinking that this may just be a
regression in Clang 19. Testing with Clang 18.1.8:
|
|
So indeed, this seems to be a regression (at the very least, for the purposes of this use case). I opened an issue on Clang/LLVM’s repo; you can find it here if you’re interested in this problem’s conclusion. Someone already responded saying that it’s a trade-off between two optimization passes called SROA (scalar replacement of aggregates) and SimplifyCFG (simplify control flow graph). Remains to be seen if this was intentional and/or if they want to do something about it going forward.
You can see the benchmark code here, or play around with a smaller reproduced version on Godbolt.
-
Note: this lookup table is not related to the
TABLE
variable that we used in the first version of theincrement()
function in the previous post. This lookup table is the assembly representation that Clang generates for the bigswitch
statement. ↩︎ -
I also ran the same code but with 100 million
first_idx
‘es instead of 300 million, just to verify that those ~300 million branch misses we saw are directly related to the number ofincrement()
invocations, and not to something else. In this case the number of branch misses was roughly 100 million, which further confirms the original expectation. ↩︎