GCC isn’t yet capable of auto-vectorizing even simple search operations on buffers for which the length isn’t explicitly specified1.
Take the classic example of strlen():
|
|
This is the assembly output, compiled with -O3 -march=znver3 -ffreestanding2:
|
|
It processes one byte at a time.
However, there is a trick that you can use to get this auto-vectorized.
New auto-vectorization capabilities in GCC 15
Loops with known lengths have gotten some significant auto-vectorization
improvements in GCC 15.1, so we will
try to leverage them in loops with unknown lengths like strlen().
For the moment, let’s change the problem of finding the null terminator’s position in a null-terminated string, to finding the first position of an arbitrary character in a char buffer of known length.
We can implement this as follows:
|
|
Clang and GCC versions prior to GCC 15.1 fail to auto-vectorize this.
On the other hand, GCC 15.1’s assembly is:
|
|
Nicely auto-vectorized. GCC’s overall approach here is to go one byte at a time
until it reaches an address aligned to 32 bytes (such that the YMMWORD loads
are aligned), and then it goes one YMMWORD at a time until either running out
of YMMWORDs, or finding a YMMWORD that contains the needle in one of its 32
elements.
So can we use this auto-vectorized find() to auto-vectorize our strlen()
implementation? Let’s think about what arguments we need to give it:
-
The
haystackargument obviously needs to be our null-terminated string; -
The
needleargument needs to be'\0'; -
Now for the important part: what should
lenbe? Since we know that it would be undefined behavior to callstrlen()with a string that doesn’t have a null terminator, we need to pass a length that almost always makesi < lenevaluate totrue(basically we want to make the condition redundant). Keyword: almost. If it always evaluates totrue, then the loop would no longer qualify as having a ‘known length’, because it would be optimized to:for (size_t i = 0; true; i++) {...}The correct choice here is any length that in practice will be larger than the length of any string that the user is able to provide. Thus, we can pass
SIZE_MAX3.
Let’s try it out.
|
|
Assembly:
|
|
Perfect, it works.
This works with more complex operations too, for example a loop that traverses
a buffer and xor’s every element with the argument a, until reaching an
element for which data[i] ^ a == b (this condition is considered to be the
guaranteed ’null terminator’ in our buffer). See the Godbolt link at the bottom
for this (and also the other functions).
I did a benchmark on an operation similar to strlen()4 and the speed-up
was between 10x and 15x on my machine for buffer lengths around 2^32. Pretty
good for some code that doesn’t require you to write inline assembly or
intrinsics.
What’s next for GCC?
Two days before I
reported this
interesting behavior to the GCC team to understand why the ‘fake length’
optimization works, a patch set containing auto-vectorization support for loops
WITHOUT known lengths had already been
submitted.
So I happened to find this optimization while they were already working on
making it not needed. For example, after the patch set is merged upstream, the
completely naive strlen() implementation should be fully auto-vectorized:
|
|
The new patch set should also make the auto-vectorization a bit shorter, since
the ‘fake length’ trick adds some overhead because the function needs to do
those i < len comparisons, even though in practice they will always be
redundant and evaluate to true (or cause a segfault if a bad argument is
passed). I don’t know the GCC version in which the patch set will be included,
though.
Thanks for reading!
See the sources at: https://godbolt.org/z/Yc1vMbqqP
You can also check out the discussion of this post on:
-
Clang isn’t capable of auto-vectorizing any of the functions discussed in this blog post. ↩︎
-
GCC and Clang are capable of recognizing code patterns equivalent to
strlen(). When doing so, they often choose to call the implementation provided by your C runtime, and this implementation can be manually vectorized depending on what flavor oflibcyou’re using. Whether or not GCC and Clang are able to recognize code patterns equivalent tostrlen()is not of interest in this blog post. We only care whether GCC/Clang themselves are able to auto-vectorize such code patterns, and for this we use-ffreestandingto tell the compiler to assume that there is no C runtime available. ↩︎ -
If the condition in the loop was
i <= leninstead ofi < len, thenSIZE_MAXwould causei <= lento always evaluate totrue, and the auto-vectorization would fail if inlining is enabled. ↩︎ -
See the attachment in my GCC report. This is a lazy benchmark since it doesn’t do fair cache warming or more intricate things like
libbenchmarkfor example, but it tries to give the byte-by-byte implementation a head start by making it run after the auto-vectorized version has already warmed the cache. So the benchmark just gives a rough idea about the possible speed-up. ↩︎