Problem
Consider that you have some mapping from indexes in the range [0, 255]
to indexes in the range
[0, 1023]
. The requirements are:
- Write a function that takes an array of 1024 8-bit unsigned integers and increments the value at an arbitrary index.
- Only use safe methods of indexing that do bounds checking (e.g. using
.at()
in C++ instead of the subscript operator). - Avoid constructs such as
__builtin_unreachable()
or__builtin_assume()
. - Arrive at an implementation where, even though you’re using checked indexing methods, those checks are omitted from the generated assembly.
Finding a solution
Based on this description, my first instinct would be to write a constexpr
table defining the
index mapping:
|
|
And then use the table like this to increment the value at a given position:
|
|
When you look at the generated assembly you’ll see that GCC and Clang are smart enough to realize
that doing bound checks on TABLE.at(first_idx)
is redundant, because TABLE
is of length 256
and the range of values for a uint8_t
index is [0, 255]
; therefore, the access is always going
to be within bounds. Great!
However, GCC and Clang are not smart enough (yet) to realize that doing bound checks on
data.at(second_idx)
is also redundant because, as per the problem statement, TABLE
contains
values strictly in the [0, 1023]
range and data
has length equal to 1024
. So the access at
second_idx
is always going to be valid in this case too.
Clang’s output goes like this, and GCC’s is similar:
|
|
That extra branch is probably not the end of the world, but we can try to do better.
So… what’s happening here? Well, it seems that Clang and GCC don’t have enough insight regarding
the range of values in TABLE
in order to make the inference that checking bounds on
data.at(second_idx)
is redundant. I submitted this missed optimization opportunity to both GCC1
and Clang/LLVM2, so hopefully they get this implemented.
In the meantime, we can get around this issue by exploring other methods of representing the index
mapping (i.e. not as a constexpr
table).
From Andrew Pinski’s comment in the report on bugzilla:
Maybe the best way to solve this is have an IPA pass over analyzes the static variables for their ranges and store that range off and then when VRP comes a long and reads the range and records it. In a similar sense as we handle return value ranges. Note there might be other bug reports for a similar thing …
I understood that GCC does have a mechanism which applies this type of optimization, called Value Range Propagation (VRP), but it’s either not implemented at all for table lookups, or its capabilities are more primitive compared to the ‘return value range propagation’ that Andrew Pinski mentioned.
So, we can then try representing the index mapping as a function which uses a big switch statement
with a case for every index in the range [0, 255]
, which returns its appropriate mapped value:
|
|
Now, if we adapt increment()
to use this function instead, like so:
|
|
Unfortunately, GCC still doesn’t apply the optimization, but Clang does:
|
|
Finally, it sees that the bounds check is redundant, and generates branchless code. So representing the mapping as a function was enough to help VRP see through the range of possible index values, and then optimize the code better.
Job done! You can see the full code here, or play around with it on Godbolt.