Auto-vectorizing operations on buffers of unknown length

GCC isn’t yet capable of auto-vectorizing even simple search operations on buffers for which the length isn’t explicitly specified¹.

Take the classic example of strlen():

1
2
3
4
5
6
7


size_t strlen(const char *str) {
    for (size_t i = 0;; i++) {
        if (str[i] == '\0') {
            return i;
        }
    }
}

This is the assembly output, compiled with -O3 -march=znver3 -ffreestanding²:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


strlen:
        xor     eax, eax
        cmp     BYTE PTR [rdi], 0
        je      .L1
.L2:
        inc     rax
        cmp     BYTE PTR [rdi+rax], 0
        jne     .L2
.L1:
        ret

It processes one byte at a time.

However, there is a trick that you can use to get this auto-vectorized.

New auto-vectorization capabilities in GCC 15

Loops with known lengths have gotten some significant auto-vectorization improvements in GCC 15.1, so we will try to leverage them in loops with unknown lengths like strlen().

For the moment, let’s change the problem of finding the null terminator’s position in a null-terminated string, to finding the first position of an arbitrary character in a char buffer of known length.

We can implement this as follows:

1
2
3
4
5
6
7
8


size_t find(const char *haystack, size_t len, char needle) {
    for (size_t i = 0; i < len; i++) {
        if (haystack[i] == needle) {
            return i;
        }
    }
    return SIZE_MAX; // Not found
}

Clang and GCC versions prior to GCC 15.1 fail to auto-vectorize this.

On the other hand, GCC 15.1’s assembly is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89


find:
        mov     ecx, edx
        test    rsi, rsi
        je      .L11
        mov     rax, rdi
        mov     r8d, 47
        lea     r9, [rsi-1]
        neg     rax
        and     eax, 31
        mov     rdx, rax
        lea     rax, [rax+32]
        cmp     rax, r8
        cmovb   rax, r8
        cmp     r9, rax
        jb      .L12
        test    rdx, rdx
        je      .L13
        xor     eax, eax
        jmp     .L6
.L21:
        inc     rax
        cmp     rdx, rax
        je      .L20
.L6:
        cmp     cl, BYTE PTR [rdi+rax]
        jne     .L21
        ret
.L11:
        mov     rax, -1
        ret
.L20:
        vmovq   xmm0, rdx
.L4:
        mov     r9, rsi
        vmovd   xmm4, ecx
        lea     rax, [rdi+rdx]
        vpbroadcastq    ymm2, xmm0
        sub     r9, rdx
        vpaddq  ymm2, ymm2, YMMWORD PTR .LC0[rip]
        vpbroadcastq    ymm3, QWORD PTR .LC2[rip]
        vpbroadcastb    ymm4, xmm4
        mov     r8, r9
        and     r8, -32
        lea     rdx, [rax+r8]
        jmp     .L9
.L7:
        add     rax, 32
        vpaddq  ymm2, ymm2, ymm3
        cmp     rax, rdx
        je      .L22
.L9:
        vpcmpeqb        ymm1, ymm4, YMMWORD PTR [rax]
        vptest  ymm1, ymm1
        je      .L7
        vmovq   rax, xmm2
        vzeroupper
        jmp     .L10
.L23:
        inc     rax
        cmp     rax, rsi
        jnb     .L11
.L10:
        cmp     cl, BYTE PTR [rdi+rax]
        jne     .L23
        ret
.L22:
        cmp     r9, r8
        je      .L14
        vmovq   rax, xmm0
        add     rax, r8
        vzeroupper
        jmp     .L10
.L12:
        xor     eax, eax
        jmp     .L10
.L13:
        vpxor   xmm0, xmm0, xmm0
        jmp     .L4
.L14:
        mov     rax, -1
        vzeroupper
        ret
.LC0:
        .quad   0
        .quad   1
        .quad   2
        .quad   3
.LC2:
        .quad   32

Nicely auto-vectorized. GCC’s overall approach here is to go one byte at a time until it reaches an address aligned to 32 bytes (such that the YMMWORD loads are aligned), and then it goes one YMMWORD at a time until either running out of YMMWORDs, or finding a YMMWORD that contains the needle in one of its 32 elements.

So can we use this auto-vectorized find() to auto-vectorize our strlen() implementation? Let’s think about what arguments we need to give it:

The haystack argument obviously needs to be our null-terminated string;
The needle argument needs to be '\0';
Now for the important part: what should len be? Since we know that it would be undefined behavior to call strlen() with a string that doesn’t have a null terminator, we need to pass a length that almost always makes i < len evaluate to true (basically we want to make the condition redundant). Keyword: almost. If it always evaluates to true, then the loop would no longer qualify as having a ‘known length’, because it would be optimized to:
```
for (size_t i = 0; true; i++) {...}
```
The correct choice here is any length that in practice will be larger than the length of any string that the user is able to provide. Thus, we can pass SIZE_MAX³.

Let’s try it out.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


__always_inline
size_t find(const char *haystack, size_t len, char needle) {
    for (size_t i = 0; i < len; i++) {
        if (haystack[i] == needle) {
            return i;
        }
    }
    return SIZE_MAX; // Not found
}

size_t strlen(const char *str) {
    return find(str, SIZE_MAX, '\0');
}

Assembly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83


strlen:
        mov     rdx, rdi
        push    rbp
        neg     rdx
        mov     rbp, rsp
        and     rsp, -32
        and     edx, 31
        je      .L10
        xor     eax, eax
.L4:
        cmp     BYTE PTR [rdi+rax], 0
        je      .L1
        inc     rax
        cmp     rdx, rax
        jne     .L4
        mov     rcx, rdx
        mov     QWORD PTR [rsp-8], rdx
        not     rcx
.L2:
        vpbroadcastq    ymm1, QWORD PTR [rsp-8]
        vmovq   xmm6, rcx
        vpbroadcastq    ymm4, QWORD PTR .LC4[rip]
        lea     rax, [rdi+rdx]
        vpbroadcastq    ymm2, xmm6
        vpaddq  ymm1, ymm1, YMMWORD PTR .LC0[rip]
        vpaddq  ymm2, ymm2, YMMWORD PTR .LC1[rip]
        lea     rsi, [rdi-32+rdx]
        vpbroadcastq    ymm3, QWORD PTR .LC5[rip]
        vpxor   xmm5, xmm5, xmm5
.L7:
        vpcmpeqb        ymm0, ymm5, YMMWORD PTR [rax]
        vptest  ymm0, ymm0
        jne     .L18
        add     rax, 32
        vpaddq  ymm1, ymm1, ymm4
        vpaddq  ymm2, ymm2, ymm3
        cmp     rax, rsi
        jne     .L7
        cmp     rdx, 31
        je      .L11
        mov     rax, QWORD PTR [rsp-8]
        lea     rdx, [rcx+32]
        sub     rax, 32
.L6:
        add     rdx, rax
.L9:
        cmp     BYTE PTR [rdi+rax], 0
        je      .L15
        inc     rax
        cmp     rdx, rax
        jne     .L9
.L11:
        mov     rax, -1
        vzeroupper
.L1:
        leave
        ret
.L15:
        vzeroupper
        leave
        ret
.L18:
        vmovq   rdx, xmm2
        vmovq   rax, xmm1
        jmp     .L6
.L10:
        mov     QWORD PTR [rsp-8], 0
        mov     rcx, -1
        jmp     .L2
.LC0:
        .quad   0
        .quad   1
        .quad   2
        .quad   3
.LC1:
        .quad   0
        .quad   -1
        .quad   -2
        .quad   -3
.LC4:
        .quad   32
.LC5:
        .quad   -32

Perfect, it works.

This works with more complex operations too, for example a loop that traverses a buffer and xor’s every element with the argument a, until reaching an element for which data[i] ^ a == b (this condition is considered to be the guaranteed ’null terminator’ in our buffer). See the Godbolt link at the bottom for this (and also the other functions).

I did a benchmark on an operation similar to strlen()⁴ and the speed-up was between 10x and 15x on my machine for buffer lengths around 2^32. Pretty good for some code that doesn’t require you to write inline assembly or intrinsics.

What’s next for GCC?

Two days before I reported this interesting behavior to the GCC team to understand why the ‘fake length’ optimization works, a patch set containing auto-vectorization support for loops WITHOUT known lengths had already been submitted. So I happened to find this optimization while they were already working on making it not needed. For example, after the patch set is merged upstream, the completely naive strlen() implementation should be fully auto-vectorized:

1
2
3
4
5
6
7


size_t strlen(const char *str) {
    for (size_t i = 0;; i++) {
        if (str[i] == '\0') {
            return i;
        }
    }
}

The new patch set should also make the auto-vectorization a bit shorter, since the ‘fake length’ trick adds some overhead because the function needs to do those i < len comparisons, even though in practice they will always be redundant and evaluate to true (or cause a segfault if a bad argument is passed). I don’t know the GCC version in which the patch set will be included, though.

Thanks for reading!

See the sources at: https://godbolt.org/z/Yc1vMbqqP

You can also check out the discussion of this post on:

r/cpp

Clang isn’t capable of auto-vectorizing any of the functions discussed in this blog post. ↩︎
GCC and Clang are capable of recognizing code patterns equivalent to strlen(). When doing so, they often choose to call the implementation provided by your C runtime, and this implementation can be manually vectorized depending on what flavor of libc you’re using. Whether or not GCC and Clang are able to recognize code patterns equivalent to strlen() is not of interest in this blog post. We only care whether GCC/Clang themselves are able to auto-vectorize such code patterns, and for this we use -ffreestanding to tell the compiler to assume that there is no C runtime available. ↩︎
If the condition in the loop was i <= len instead of i < len, then SIZE_MAX would cause i <= len to always evaluate to true, and the auto-vectorization would fail if inlining is enabled. ↩︎
See the attachment in my GCC report. This is a lazy benchmark since it doesn’t do fair cache warming or more intricate things like libbenchmark for example, but it tries to give the byte-by-byte implementation a head start by making it run after the auto-vectorized version has already warmed the cache. So the benchmark just gives a rough idea about the possible speed-up. ↩︎

←HomeAbout→ Auto-vectorizing operations on buffers of unknown length

New auto-vectorization capabilities in GCC 15

What’s next for GCC?