←HomeAbout→ Eliminating redundant bound checks

Problem


Consider that you have some mapping from indexes in the range [0, 255] to indexes in the range [0, 1023]. The requirements are:

  1. Write a function that takes an array of 1024 8-bit unsigned integers and increments the value at an arbitrary index.
  2. Only use safe methods of indexing that do bounds checking (e.g. using .at() in C++ instead of the subscript operator).
  3. Avoid constructs such as __builtin_unreachable() or __builtin_assume().
  4. Arrive at an implementation where, even though you’re using checked indexing methods, those checks are omitted from the generated assembly.

Finding a solution


Based on this description, my first instinct would be to write a constexpr table defining the index mapping:

1
2
3
4
5
6
// Pre-defined index mapping:
static constexpr std::array<size_t, 256> TABLE = {
    798, 553, 541, 345, 276, 698, 861, 448, 898, 588, 678,  593, 265,  611, 915,
    835, 893, 3,   411, 769, 792, 115, 526, 836, 356, 454,  279, 876,  924, 644,
//  ............................................................................
};

And then use the table like this to increment the value at a given position:

1
2
3
4
5
void increment(std::array<uint8_t, 1024> &data, uint8_t first_idx)
{
    size_t second_idx = TABLE.at(first_idx);
    data.at(second_idx)++;
}

When you look at the generated assembly you’ll see that GCC and Clang are smart enough to realize that doing bound checks on TABLE.at(first_idx) is redundant, because TABLE is of length 256 and the range of values for a uint8_t index is [0, 255]; therefore, the access is always going to be within bounds. Great!

However, GCC and Clang are not smart enough (yet) to realize that doing bound checks on data.at(second_idx) is also redundant because, as per the problem statement, TABLE contains values strictly in the [0, 1023] range and data has length equal to 1024. So the access at second_idx is always going to be valid in this case too.

Clang’s output goes like this, and GCC’s is similar:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
increment(std::array<unsigned char, 1024ul>&, unsigned char):
        mov     eax, esi
        lea     rcx, [rip + TABLE]
        mov     rsi, qword ptr [rcx + 8*rax]
        cmp     rsi, 1024 # REDUNDANT CHECK!
        jae     .LBB0_2
        inc     byte ptr [rdi + rsi]
        ret
.LBB0_2: # REDUNDANT BRANCH!
        push    rax
        lea     rdi, [rip + .L.str]
        mov     edx, 1024
        xor     eax, eax
        call    std::__throw_out_of_range_fmt(char const*, ...)@PLT

TABLE:
        .quad   798
        .quad   553
#       ...........

That extra branch is probably not the end of the world, but we can try to do better.

So… what’s happening here? Well, it seems that Clang and GCC don’t have enough insight regarding the range of values in TABLE in order to make the inference that checking bounds on data.at(second_idx) is redundant. I submitted this missed optimization opportunity to both GCC1 and Clang/LLVM2, so hopefully they get this implemented.

In the meantime, we can get around this issue by exploring other methods of representing the index mapping (i.e. not as a constexpr table).

From Andrew Pinski’s comment in the report on bugzilla:

Maybe the best way to solve this is have an IPA pass over analyzes the static variables for their ranges and store that range off and then when VRP comes a long and reads the range and records it. In a similar sense as we handle return value ranges. Note there might be other bug reports for a similar thing …

I understood that GCC does have a mechanism which applies this type of optimization, called Value Range Propagation (VRP), but it’s either not implemented at all for table lookups, or its capabilities are more primitive compared to the ‘return value range propagation’ that Andrew Pinski mentioned.

So, we can then try representing the index mapping as a function which uses a big switch statement with a case for every index in the range [0, 255], which returns its appropriate mapped value:

1
2
3
4
5
6
7
8
static size_t first_idx_to_second_idx(uint8_t first_idx)
{
    // Pre-defined index mapping:
    switch (first_idx) {
    case 0: return 798; case 1: return 553; case 2: return 541; case 3: return 345;
    case 4: return 276; case 5: return 698; case 6: return 861; case 7: return 448;
//  ...............................................................................
}

Now, if we adapt increment() to use this function instead, like so:

1
2
3
4
5
void increment(std::array<uint8_t, 1024> &data, uint8_t first_idx)
{
    size_t second_idx = first_idx_to_second_idx(first_idx); // Use this instead of TABLE
    data.at(second_idx)++;
}

Unfortunately, GCC still doesn’t apply the optimization, but Clang does:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
increment(std::array<unsigned char, 1024ul>&, unsigned char):
        add     sil, -128
        movzx   eax, sil
        lea     rcx, [rip + .Lswitch.table.increment(std::array<unsigned char, 1024ul>&, unsigned char)]
        mov     rax, qword ptr [rcx + 8*rax]
        inc     byte ptr [rdi + rax]
        ret

.Lswitch.table.increment(std::array<unsigned char, 1024ul>&, unsigned char):
        .quad   782
        .quad   582
#       ...........

Finally, it sees that the bounds check is redundant, and generates branchless code. So representing the mapping as a function was enough to help VRP see through the range of possible index values, and then optimize the code better.

Job done! You can see the full code here, or play around with it on Godbolt.