Disclaimer: The strlen function rarely lies on the critical path of the program, and if it is, you should store the string length in an integer variable (Pascal-style strings). This article should be viewed as an exercise in code optimization, not as a recommendation for everyday programming.
Vectorized strlen
A usual implementation of strlen function scans the string byte-by-byte looking for terminating zero. For example:
size_t strlen(const char *s) {
const char *start = s;
while(*s)
s++;
return s - start;
}
It's small and easy, but not very fast. That is how we can improve it:
// for x86 only
size_t my_strlen(const char *s) {
size_t len = 0;
for(;;) {
unsigned x = *(unsigned*)s;
if((x & 0xFF) == 0) return len;
if((x & 0xFF00) == 0) return len + 1;
if((x & 0xFF0000) == 0) return len + 2;
if((x & 0xFF000000) == 0) return len + 3;
s += 4, len += 4;
}
}
Four bytes are examined at once. The program reads a double word from memory, extracts each of its bytes by ANDing with a mask, and compares the bytes with zero. That is what Agner Fog calls "vector operations in general purpose registers".
Problems and solutions
Warning: this function will crash if an non-readable memory page is located right after the end of the string. The simplest way to prevent this is to allocate 3 additional bytes at the end of string.
The dwords may be unaligned, but x86 architecture allows access to unaligned data. For small strings, the alignment will take more time than the penalty of unaligned reads.
The code is not portable: you will have to add another 4 conditions if you use a 64-bit processor. For big-endian architectures, the order of conditions should be reversed.
Agner Fog made another strlen function for his tutorials (file optimizing_assembly.pdf, chapter 13.5). The idea is the same, but he uses clever math tricks to avoid branches inside the loop.
Test conditions and results
The functions were tested on several short strings (words in Gettysburg address) and on a long string (Lewis Carroll's Jabberwocky). The RDTSC instruction was used for measurements.
| Function | Short strings | The long string |
|---|---|---|
| strlen by Microsoft | 1471 | 2104 |
| strlen_my | 1019 | 1471 |
| strlen by Agner Fog | 1279 | 1056 |
Agner Fog's function is faster for the long string, while strlen_my performs better on the short strings.
Download the test program (6 Kb)
Related articles
SSE2 optimised strlen by Dmitry Kostjuchenko
Implementing strcmp, strlen, and strstr using SSE 4.2 instructions by Peter Kankowski

56 comments
Ten recent comments are shown below. Show all comments
You don't need sse4 to get this speed-up; pcmpeq + movemask + bfs do the job quite well.
Here's an sse2 equivalent, with attention to using aligned loads. It's gcc 4.4.
Pardon my dense style and presumed #includes
#define xmload(p) _mm_load_si128((__m128i const*)p) #define xmatch(a,b) _mm_movemask_epi8(_mm_cmpeq_epi8(a, b)) char const *sse2_strchr(char const *str, char ch) { __m128i x, zero = {}, p0 = _mm_set1_epi8(ch); unsigned z, m, f = 15 & (uintptr_t)str; // Initial unaligned chunk of tgt: if (f) { z = xmatch(x = xmload(str - f), zero) >> f; m = (xmatch(x,p0) >> f) & ~z & (z - 1) if (m) return tgt + ffs(m) - 1; if (z) return NULL; tgt += 16 - f; } // 16-byte aligned chunks of str: for (; !(z = xmatch(x = xmload(str), zero)); str += 16) if ((m = xmatch(x,p0))) return str + ffs(m) - 1; // Final 0..15 bytes of tgt: if ((m = xmatch(x,p0) & ~z & (z - 1))) return str + ffs(m) - 1; return NULL; }Thank you. Agner Fog uses the same approach for strlen in his asmlib. Unfortunately, even after small corrections your code fails the tests.
#include <stdio.h> #include <stdint.h> #include <string.h> #include <assert.h> #include <emmintrin.h> #define xmload(p) _mm_load_si128((__m128i const*)p) #define xmatch(a,b) _mm_movemask_epi8(_mm_cmpeq_epi8(a, b)) #ifdef __GNUC__ #define ctz(a) __builtin_ctz(a) #elif defined(_MSC_VER) #include <intrin.h> inline unsigned long ctz(unsigned long x) { unsigned long index; _BitScanForward(&index, x); return index; } #endif // by mischa sandberg char const *sse2_strchr(char const *str, char ch) { __m128i x, zero = {}, p0 = _mm_set1_epi8(ch); unsigned z, m, f = 15 & (uintptr_t)str; // Initial unaligned chunk of tgt: if (f) { z = xmatch(x = xmload(str - f), zero) >> f; m = (xmatch(x,p0) >> f) & ~z & (z - 1); if (m) return str + ctz(m); if (z) return NULL; str += 16 - f; } // 16-byte aligned chunks of str: for (; !(z = xmatch(x = xmload(str), zero)); str += 16) if ((m = xmatch(x,p0))) return str + ctz(m); // Final 0..15 bytes of tgt: if ((m = xmatch(x,p0) & ~z & (z - 1))) return str + ctz(m); return NULL; } int main() { assert( strcmp( sse2_strchr("abc", 'a'), "abc" ) ==0 ); assert( NULL == sse2_strchr("", 'a') ); assert( strcmp( sse2_strchr("abc", 'b'), "bc" ) ==0 ); assert( NULL == sse2_strchr("abc", 'z') ); assert( "st" == sse2_strchr("abc def ghi jkl mno pqr st", 's') ); assert( "mno pqr st" == sse2_strchr("abc def ghi jkl mno pqr st", 'm') ); return 0; }(crashes on the first xmload with access violation; tested under MSVC++ and GCC)Gah! Sorry. To save space I compressed xmload/xmatch from static inline functions into #define's.
The fix is:
Also, I'm fascinated that
assert("st" == sse2_strchr(...))works, as opposed to
assert(!strcmp("st", sse2_strchr(...)))With the compile switches I use, that doesn't get past a compile error:
"comparison with string literal results in unspecified behavior"
The above logic can be extended to a strstr that scans for a match on the first two chars of the pattern.
Sorry for not working out the MSVC details here.
$ cat toad.c #include <string.h> #include <stdint.h> #include <emmintrin.h> static inline int under(int x) { return ~x & (x - 1); } #define ctz(m) (ffs(m) - 1) #define xmload(p) _mm_load_si128((__m128i const*)(p)) #define xmatch(a,b) _mm_movemask_epi8(_mm_cmpeq_epi8(a, b)) static char const *scanchr2(char const *str, char const pat[2]) { __m128i x, zero = {}; __m128i p0 = _mm_set1_epi8(pat[0]), p1 = _mm_set1_epi8(pat[1]); uint16_t pair = *(uint16_t const*)pat; unsigned z, m, f = 15 & (uintptr_t)str; // Initial unaligned chunk of str: if (f) { z = xmatch(x = xmload(str - f), zero) >> f; m = under(z) & ((xmatch(x,p0) & (xmatch(x,p1) >> 1)) >> f); if (m) return str + ctz(m); if (z) return NULL; str += 16 - f; if (*(uint16_t const*)(str - 1) == pair) return str - 1; } // 16-byte aligned chunks of str: while (!(z = xmatch(x = xmload(str), zero))) { m = xmatch(x,p0) & (xmatch(x,p1) >> 1); if (m) return str + ctz(m); str += 16; if (*(uint16_t const*)(str - 1) == pair) return str - 1; } // Final 0..15 bytes of str: m = under(z) & xmatch(x,p0) & (xmatch(x,p1) >> 1); return m ? str + ctz(m) : NULL; } static char const *sse2_strstr(char const *str, char const *pat) { unsigned patlen = strlen(pat); if (patlen == 0) return str; if (patlen == 1) return strchr(str, *pat); while ((str = scanchr2(str, pat))) if (!memcmp(str+2, pat+2, patlen-2)) return str; return NULL; } int main(void) { return strcmp("world", sse2_strstr("hello, world", "world")); } $ gmake toad cc -g -MMD -O9 -fno-strict-aliasing -fstack-protector --param ssp-buffer-size=4 -Wall -Werror -Wextra -Wcast-align -Wcast-qual -Wformat=2 -Wformat-security -Winline -Wmissing-prototypes -Wnested-externs -Wpointer-arith -Wredundant-decls -Wshadow -Wstrict-prototypes -Wunused -Wwrite-strings -O0 -msse3 -D_GNU_SOURCE -I/usr/local/include -c -o toad.o toad.c cc -L/usr/local/lib toad.o -ldl -lresolv -o toad $ ./toad $ echo $? 0The above runs at about 10-15 times the speed of the glibc 4.4.5 strstr for any pattern less than 32 bytes.
This includes edge cases like the match occurring at offset 0 of the target string.
Note that glibc strlen (and hence glibc strstr) benefits from the same strlen that in fact uses the basic above idea.
There's a speed drop to 2-3 times the speed of the glibc strstr at 32 bytes ... still chewing on why.
mischasan.wordpress.com for the things I've been finding.
And, umm, scratch the -O0 in the above "cc" line if you want to see the 10-15 times speedup :-)
I'm sorry, it certainly should be strcmp. The two-char strstr is a great idea. I've subscribed to your blog.
On x86 and x86_64 (on core I7), there is no penalty for any misaligned memory access within a cache line, and for misaligned accesses crossing a cache line, the penalty is one cycle. You might as well consider misaligned accesses completely free. AMD architectures also have negligible penalties for misalignment.
I think I know why: I was doing performance analysis once and I enabled alignment checking in the eflags register and setup my debugger to catch the alignment traps and record the stack traces. To my surprise, I was getting millions of misaligned accesses throughout windows code and in the CRT. My theory is, Intel engineers did a similar investigation and found misaligned accesses to be happening so often, that they realized that they needed to make it super fast. In Core I7, the handling of misaligned accesses is absolutely amazing.
The first Intel processors x86 with no penalty for some misaligned memory access that I know of were some Core 2 models on 45 nm. The ones on 65 nm were significantly slower.