Abstract:
Unrolling the loop in strlen function.

Created by Peter Kankowski
Last changed
Contributors: Ace, Vladimir Kraljevic, and Madis Kalme
Filed under Low-level code optimization

Share on social sitesReddit Digg Delicious Buzz Facebook Twitter

Fast strlen function

Disclaimer: The strlen function rarely lies on the critical path of the program, and if it is, you should store the string length in an integer variable (Pascal-style strings). This article should be viewed as an exercise in code optimization, not as a recommendation for everyday programming.

Vectorized strlen

A usual implementation of strlen function scans the string byte-by-byte looking for terminating zero. For example:

size_t strlen(const char *s) {
    const char *start = s;
    while(*s)
        s++;
    return s - start;
}

It's small and easy, but not very fast. That is how we can improve it:

// for x86 only
size_t my_strlen(const char *s) {
    size_t len = 0;
    for(;;) {
        unsigned x = *(unsigned*)s;
        if((x & 0xFF) == 0) return len;
        if((x & 0xFF00) == 0) return len + 1;
        if((x & 0xFF0000) == 0) return len + 2;
        if((x & 0xFF000000) == 0) return len + 3;
        s += 4, len += 4;
    }
}

Four bytes are examined at once. The program reads a double word from memory, extracts each of its bytes by ANDing with a mask, and compares the bytes with zero. That is what Agner Fog calls "vector operations in general purpose registers".

Problems and solutions

Warning: this function will crash if an non-readable memory page is located right after the end of the string. The simplest way to prevent this is to allocate 3 additional bytes at the end of string.

The dwords may be unaligned, but x86 architecture allows access to unaligned data. For small strings, the alignment will take more time than the penalty of unaligned reads.

The code is not portable: you will have to add another 4 conditions if you use a 64-bit processor. For big-endian architectures, the order of conditions should be reversed.

Agner Fog made another strlen function for his tutorials (file optimizing_assembly.pdf, chapter 13.5). The idea is the same, but he uses clever math tricks to avoid branches inside the loop.

Test conditions and results

The functions were tested on several short strings (words in Gettysburg address) and on a long string (Lewis Carroll's Jabberwocky). The RDTSC instruction was used for measurements.

FunctionShort stringsThe long string
strlen by Microsoft14712104
strlen_my10191471
strlen by Agner Fog12791056

Agner Fog's function is faster for the long string, while strlen_my performs better on the short strings.

Download the test program (6 Kb)

Related articles

SSE2 optimised strlen by Dmitry Kostjuchenko

Implementing strcmp, strlen, and strstr using SSE 4.2 instructions by Peter Kankowski

Peter Kankowski
Peter Kankowski

About the author

Peter lives in Siberia, the land of sleeping sun, beautiful mountains, and infinitely deep snow. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP (on Windows platform). He can be reached at kankowski@narod.ru.

56 comments

Ten recent comments are shown below. Show all comments

mischa sandberg,

You don't need sse4 to get this speed-up; pcmpeq + movemask + bfs do the job quite well.

Here's an sse2 equivalent, with attention to using aligned loads. It's gcc 4.4.

Pardon my dense style and presumed #includes

#define xmload(p)   _mm_load_si128((__m128i const*)p)
#define xmatch(a,b) _mm_movemask_epi8(_mm_cmpeq_epi8(a, b))

char const *sse2_strchr(char const *str, char ch)
{
    __m128i     x, zero = {}, p0 = _mm_set1_epi8(ch);
    unsigned    z, m, f = 15 & (uintptr_t)str;

    // Initial unaligned chunk of tgt:
    if (f) {
        z = xmatch(x = xmload(str - f), zero) >> f;
        m = (xmatch(x,p0) >> f) & ~z & (z - 1)
        if (m) return tgt + ffs(m) - 1;
        if (z) return NULL;
        tgt += 16 - f;
    }

    // 16-byte aligned chunks of str:
    for (; !(z = xmatch(x = xmload(str), zero)); str += 16)
        if ((m = xmatch(x,p0)))
            return str + ffs(m) - 1;

    // Final 0..15 bytes of tgt:
    if ((m = xmatch(x,p0) & ~z & (z - 1)))
        return str + ffs(m) - 1;

    return NULL;
}

Peter Kankowski,

Thank you. Agner Fog uses the same approach for strlen in his asmlib. Unfortunately, even after small corrections your code fails the tests.

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <assert.h>
#include <emmintrin.h>

#define xmload(p)   _mm_load_si128((__m128i const*)p)
#define xmatch(a,b) _mm_movemask_epi8(_mm_cmpeq_epi8(a, b))
#ifdef __GNUC__
#define ctz(a) __builtin_ctz(a)
#elif defined(_MSC_VER)
#include <intrin.h>
inline unsigned long ctz(unsigned long x) {
	unsigned long index;
	_BitScanForward(&index, x);
	return index;
}
#endif

// by mischa sandberg
char const *sse2_strchr(char const *str, char ch)
{
    __m128i     x, zero = {}, p0 = _mm_set1_epi8(ch);
    unsigned    z, m, f = 15 & (uintptr_t)str;

    // Initial unaligned chunk of tgt:
    if (f) {
        z = xmatch(x = xmload(str - f), zero) >> f;
        m = (xmatch(x,p0) >> f) & ~z & (z - 1);
        if (m) return str + ctz(m);
        if (z) return NULL;
        str += 16 - f;
    }

    // 16-byte aligned chunks of str:
    for (; !(z = xmatch(x = xmload(str), zero)); str += 16)
        if ((m = xmatch(x,p0)))
            return str + ctz(m);

    // Final 0..15 bytes of tgt:
    if ((m = xmatch(x,p0) & ~z & (z - 1)))
        return str + ctz(m);

    return NULL;
}


int main() {
    assert( strcmp( sse2_strchr("abc", 'a'), "abc" ) ==0 );
    assert( NULL == sse2_strchr("", 'a') );
    assert( strcmp( sse2_strchr("abc", 'b'), "bc" ) ==0 );
    assert( NULL == sse2_strchr("abc", 'z') );
    
    assert( "st" == sse2_strchr("abc def ghi jkl mno pqr st", 's') );
    assert( "mno pqr st" == sse2_strchr("abc def ghi jkl mno pqr st", 'm') );
    return 0;
}
(crashes on the first xmload with access violation; tested under MSVC++ and GCC)
mischa sandberg,

Gah! Sorry. To save space I compressed xmload/xmatch from static inline functions into #define's.

The fix is:

#define xmload(p)   _mm_load_si128((__m128i const*)(p))

mischa sandberg,

Also, I'm fascinated that

   assert("st" == sse2_strchr(...)) 

works, as opposed to

   assert(!strcmp("st", sse2_strchr(...)))

With the compile switches I use, that doesn't get past a compile error:

"comparison with string literal results in unspecified behavior"

mischa sandberg,

The above logic can be extended to a strstr that scans for a match on the first two chars of the pattern.

Sorry for not working out the MSVC details here.

$ cat toad.c
#include <string.h>
#include <stdint.h>
#include <emmintrin.h>
static inline int under(int x) { return ~x & (x - 1); }
#define ctz(m)      (ffs(m) - 1)
#define xmload(p)   _mm_load_si128((__m128i const*)(p))
#define xmatch(a,b) _mm_movemask_epi8(_mm_cmpeq_epi8(a, b))
static char const *scanchr2(char const *str, char const pat[2])
{
    __m128i     x, zero = {};
    __m128i     p0 = _mm_set1_epi8(pat[0]), p1 = _mm_set1_epi8(pat[1]);
    uint16_t    pair = *(uint16_t const*)pat;
    unsigned    z, m, f = 15 & (uintptr_t)str;

    // Initial unaligned chunk of str:
    if (f) {
        z = xmatch(x = xmload(str - f), zero) >> f;
        m = under(z) & ((xmatch(x,p0) & (xmatch(x,p1) >> 1)) >> f);
        if (m) return str + ctz(m);
        if (z) return NULL;
        str += 16 - f;
        if (*(uint16_t const*)(str - 1) == pair)
            return str - 1;
    }

    // 16-byte aligned chunks of str:
    while (!(z = xmatch(x = xmload(str), zero))) {
        m = xmatch(x,p0) & (xmatch(x,p1) >> 1);
        if (m) return str + ctz(m);
        str += 16;
        if (*(uint16_t const*)(str - 1) == pair)
            return str - 1;
    }

    // Final 0..15 bytes of str:
    m = under(z) & xmatch(x,p0) & (xmatch(x,p1) >> 1);
    return m ? str + ctz(m) : NULL;
}

static char const *sse2_strstr(char const *str, char const *pat)
{
    unsigned patlen = strlen(pat);
    if (patlen == 0) return str;
    if (patlen == 1) return strchr(str, *pat);
    while ((str = scanchr2(str, pat)))
        if (!memcmp(str+2, pat+2, patlen-2))
            return str;
    return NULL;
}

int main(void)
{
    return strcmp("world", sse2_strstr("hello, world", "world"));
}

$ gmake toad
cc -g -MMD -O9 -fno-strict-aliasing -fstack-protector --param ssp-buffer-size=4 -Wall -Werror -Wextra -Wcast-align -Wcast-qual -Wformat=2 -Wformat-security -Winline -Wmissing-prototypes -Wnested-externs -Wpointer-arith -Wredundant-decls -Wshadow -Wstrict-prototypes -Wunused -Wwrite-strings -O0 -msse3 -D_GNU_SOURCE -I/usr/local/include  -c -o toad.o toad.c
cc -L/usr/local/lib  toad.o  -ldl -lresolv -o toad
$ ./toad
$ echo $?
0

mischa sandberg,

The above runs at about 10-15 times the speed of the glibc 4.4.5 strstr for any pattern less than 32 bytes.

This includes edge cases like the match occurring at offset 0 of the target string.

Note that glibc strlen (and hence glibc strstr) benefits from the same strlen that in fact uses the basic above idea.

There's a speed drop to 2-3 times the speed of the glibc strstr at 32 bytes ... still chewing on why.

mischasan.wordpress.com for the things I've been finding.

mischa sandberg,

And, umm, scratch the -O0 in the above "cc" line if you want to see the 10-15 times speedup :-)

Peter Kankowski,
Also, I'm fascinated that assert("st" == sse2_strchr(...)) works

I'm sorry, it certainly should be strcmp. The two-char strstr is a great idea. I've subscribed to your blog.

Misaligned memory accesses on x86,

On x86 and x86_64 (on core I7), there is no penalty for any misaligned memory access within a cache line, and for misaligned accesses crossing a cache line, the penalty is one cycle. You might as well consider misaligned accesses completely free. AMD architectures also have negligible penalties for misalignment.

I think I know why: I was doing performance analysis once and I enabled alignment checking in the eflags register and setup my debugger to catch the alignment traps and record the stack traces. To my surprise, I was getting millions of misaligned accesses throughout windows code and in the CRT. My theory is, Intel engineers did a similar investigation and found misaligned accesses to be happening so often, that they realized that they needed to make it super fast. In Core I7, the handling of misaligned accesses is absolutely amazing.

ace,

The first Intel processors x86 with no penalty for some misaligned memory access that I know of were some Core 2 models on 45 nm. The ones on 65 nm were significantly slower.

Your name:
Comment:

Please ignore this field: