A couple of days ago I was thinking about what you can do when the branch predictor is effectively working against you, and thus pessimizing your program instead of optimizing it.
Let’s work with something relatively simple & concrete: consider that we want to write some kind of financial system (maybe a trading system) and all of our transaction requests arrive at a certain function before being either (a) sent out to some authoritative server, or (b) abandoned. Let’s also assume that:
- The vast majority of our transaction requests end up being abandoned at the last step.
- We care A LOT about the speed of the ‘send’ path and we want to make it as fast as possible.
- We don’t care at all about the speed of the ‘abandon’ path.
The code would look roughly like this:
|
|
The implication of assumption #1 is that the branch predictor will be heavily primed to predict that
should_send(t)
returns false
. What this means is that when we finally get a transaction that
needs to be sent out, we’ll likely have to pay a branch misprediction penalty (~20 cycles); plus,
our send()
function may not be in the instruction cache just yet, because it’s so rarely executed.
Also, we won’t get the advantage of pipelining whatever instructions are needed for send()
.
Since we only care about the speed of the send()
path, I was wondering if there’s a way of telling
the CPU that we don’t actually want to rely on the branch predictor at all when executing send()
or abandon()
: we always want to assume that send()
will be executed.
A low-level solution?
I asked Claude if there is such a way to basically hard-code branch prediction rules into the
machine code, and the answer was that there’s no way to do this on x86, but there is a way on ARM:
the BEQP
(predict branch taken) and BEQNP
(predict branch not taken) instructions.
Those ARM instructions are just hallucinated, and the reality is actually the other way around: ARM doesn’t have a way of hard-coding ‘predictions’, but x86 does. More accurately: some old x86 processors do. On the Pentium 4 series, those rules/hints are encoded as instruction prefixes:
0x2E
(branch not taken);0x3E
(branch taken).
So if a jump instruction came in with the prefix 0x3E
, then the processor would assume that the
branch is taken.
On modern x86 processors, those instruction prefixes are simply ignored, so compilers obviously won’t even bother generating them when you’re targeting such a CPU.
Another ’low-level’ approach that doesn’t work here are the [[likely]]
and [[unlikely]]
attributes introduced in C++20.
Those attributes will typically just reorder some labels/paths around to make sure that the path
which we mark as [[likely]]
will require less jumps. Whether we mark the send()
branch as likely
or we leave the code as-is, Clang and GCC generate the same assembly1:
|
|
It won’t really matter that the send()
branch doesn’t need to do the jump at line 7, because the
branch predictor will be primed to assume the abandon()
branch nonetheless. If you’re writing code
for a modern x86 processor, you can consider that [[likely]]
and [[unlikely]]
are completely
unrelated to the CPU branch predictor, because the microarchitecture itself won’t have any way of
overriding its predictions. So those attributes are just a reordering mechanism. As noted in the
proposal doc, however, the
compiler could use the likely/unlikely hints to override the branch predictor:
Some of the possible code generation improvements from using branch probability hints include:
- […]
- Some microarchitectures, such as Power, have branch hints which can override dynamic branch prediction.
A higher-level solution?
So if we can’t bypass the branch predictor on modern x86 processors with some fine-grained and guaranteed mechanism, and we also don’t want to go buy a bunch of Pentium 4 CPUs and run our code on them, we need to think about a higher-level solution that works well enough; it doesn’t need to be guaranteed to work 100% of the time.
One such solution that I know of is one that Carl Cook talked about during his CppCon 17 talk2:
we can fill our system with mocked transaction data for which should_send(t)
returns true
. We
have to do this enough times such that the mocked transaction data becomes the vast majority over
the real data, so the branch predictor will be primed to assume that the send()
path will be
executed practically on every resolve()
call. Overall, this may actually be a better approach than
hard-coding prediction rules, because those rules wouldn’t really guarantee us that the whole
send()
path would get executed (the assumption may just lead to partial execution, until the
should_send(t)
condition is actually evaluated and the pipeline is flushed; so at the end we may
still have important stuff not placed in the instruction/data cache).
For getting rid of those mocked transactions after they get ‘sent out’ from our program, Carl Cook mentions that network cards are capable of identifying and discarding such messages without adding significant overhead to the real transactions, so we can just leave it at that.
As mentioned in his talk, this whole ‘mocked/dummy transaction’ system gives him a 5 microsecond speed-up which, as the talk’s title suggests, matters a lot for his use case.
Thanks for reading!
You can check out the r/cpp
discussion regarding this blog post
here.
References:
- Carl Cook’s CppCon talk (see footnote);
- CppCon 2022: C++20’s [[likely]] Attribute - Optimizations, Pessimizations, and [[unlikely]] Consequences, page 33;
- Answer to “Is there a compiler hint for GCC to force branch prediction to always go a certain way?” on Stackoverflow.
-
This is not to say that
[[likely]]
and[[unlikely]]
don’t have an effect generally speaking. It’s just that the default codegen for the code snippet that I gave is already one that makes the first branch ’likely’. If you negated the condition and swapped the 2 branches around, marking thesend()
branch as[[likely]]
would change the generated assembly. ↩︎ -
CppCon 2017: Carl Cook “When a Microsecond Is an Eternity: High Performance Trading Systems in C++” ↩︎