Abstract:
Unrolling the loop in strlen function.

Created 1 year ago by Peter Kankowski
Last changed 2 days ago
Contributors: Ace, Vladimir Kraljevic, and Madis Kalme
Filed under Low-level code optimization

Fast strlen function

Disclaimer: The strlen function rarely lies on the critical path of the program, and if it is, you should store the string length in an integer variable (Pascal-style strings). This article should be viewed as an exercise in code optimization, not as a recommendation for everyday programming.

Vectorized strlen

A usual implementation of strlen function scans the string byte-by-byte looking for terminating zero. For example:

size_t strlen(const char *s) {
    const char *start = s;
    while(*s)
        s++;
    return s - start;
}

It's small and easy, but not very fast. That is how we can improve it:

// for x86 only
size_t my_strlen(const char *s) {
    size_t len = 0;
    for(;;) {
        unsigned x = *(unsigned*)s;
        if((x & 0xFF) == 0) return len;
        if((x & 0xFF00) == 0) return len + 1;
        if((x & 0xFF0000) == 0) return len + 2;
        if((x & 0xFF000000) == 0) return len + 3;
        s += 4, len += 4;
    }
}

Four bytes are examined at once. The program reads a double word from memory, extracts each of its bytes by ANDing with a mask, and compares the bytes with zero. That is what Agner Fog calls "vector operations in general purpose registers".

Problems and solutions

Warning: this function will crash if an non-readable memory page is located right after the end of the string. The simplest way to prevent this is to allocate 3 additional bytes at the end of string.

The dwords may be unaligned, but x86 architecture allows access to unaligned data. For small strings, the alignment will take more time than the penalty of unaligned reads.

The code is not portable: you will have to add another 4 conditions if you use a 64-bit processor. For big-endian architectures, the order of conditions should be reversed.

Agner Fog made another strlen function for his tutorials (file optimizing_assembly.pdf, chapter 13.5). The idea is the same, but he uses clever math tricks to avoid branches inside the loop.

Test conditions and results

The functions were tested on several short strings (words in Gettysburg address) and on a long string (Lewis Carroll's Jabberwocky). The RDTSC instruction was used for measurements.

FunctionShort stringsThe long string
strlen by Microsoft14712104
strlen_my10191471
strlen by Agner Fog12791056

Agner Fog's function is faster for the long string, while strlen_my performs better on the short strings.

Download the test program (6 Kb)

40 comments

Ten recent comments are shown below. Show all comments

Brian Hurt, 1 year ago

You missed an obvious optimization. You can detect a null in a four-byte word like:

    unsigned y = x & 0x7F7F7F7Fu;
    y += 0x7F7F7F7F;
    y |= x;
    if ((y & 0x80808080) != 0x80808080u) {
        /* x has a nul- check each byte */
    } else {
        /* x does not have a null */
    }

The advantage this approach has is fewer jumps- and it uses only simple (1-clock) instructions, and 2 registers- allowing you to unroll this loop 2 or 3 times and give the execution units plenty to do.

Paolo Bonzini, 1 year ago

Regarding invalid accesses, one possibility is this (sorry for the goto, so much easier to write it this way):

len = 0;
if ((((intptr_t)p) & (sizeof(long)-1)) == 0) goto loop;
if (*p++ == 0) return len; else len++;
if ((((intptr_t)p) & (sizeof(long)-1)) == 0) goto loop;
if (*p++ == 0) return len; else len++;
if ((((intptr_t)p) & (sizeof(long)-1)) == 0) goto loop;
if (*p++ == 0) return len; else len++;
if ((((intptr_t)p) & (sizeof(long)-1)) == 0) goto loop;
if (*p++ == 0) return len; else len++;
loop:
    for(;;) {
        unsigned x = *(unsigned*)s;
        if((x & 0xFF) == 0) return len;
        if((x & 0xFF00) == 0) return len + 1;
        if((x & 0xFF0000) == 0) return len + 2;
        if((x & 0xFF000000) == 0) return len + 3;
        s += 4, len += 4;
    }

By aligning the address at the beginning, you ensure the code will never perform memory accesses that span two pages.

Peter Kankowski, 1 year ago

Thank you, it's a nice idea! I would rewrite it with a fall-through switch (similar to Duff's device) to avoid the excess conditions:

int strlen_my(const char *s) {
    int len = 0;
    const char *p = s;
    switch ( (intptr_t)p & 3 ) {
        case 3:
            if (*p++ == 0)
               return p - s;
        case 2:
            if (*p++ == 0)
               return p - s;
        case 1:
            if (*p++ == 0)
               return p - s;
    }
 
    for(;;) {
        unsigned x = *(unsigned*)p;
        if((x & 0xFF) == 0) return p - s;
        if((x & 0xFF00) == 0) return p - s + 1;
        if((x & 0xFF0000) == 0) return p - s + 2;
        if((x & 0xFF000000) == 0) return p - s + 3;
        p += 4;
    }
}

The benchmark shows that this code is slower on short strings, but slightly faster on a long string because of aligned memory access. Thank you again for the idea.

Paolo Bonzini, 1 year ago

Can you find where is the break-even point WRT Agner's code now? Probably this is as good as you can be for reasonably sized strings; and safer than your original code which makes the comparison with your code apples-to-oranges, basically.

(Note that you are trading up to 3 conditions for always 2 conditions – or no change at all if the compiler decides it's not worth using a decision tree implementation of the switch statement – so not much is changing regarding the number of jumps).

k06a, 10 months ago
s+=4; => ((unsigned*)s)++;
// Increment is more usefull i think.
Dmitry Kostjuchenko, 9 months ago

I would propose aligning in while{} loop to avoid excessive conditions, this shall outperform fall-through switch as well, at least in my syntetic tests by +1.5%.

size_t __aux_strlen(const char *str)
{
    register size_t len = 0;
    // align pointer to 32 bits
    while ((((intptr_t)str) & (sizeof(unsigned int)-1)) != 0)
    {
        if (*str ++ == 0) 
            return len; 
        ++ len;
    }
    // search for 0
    for (;;)
    {
        unsigned int v = *(unsigned int *)str;
        if ((v & 0x000000FFU) == 0) return (len);
        if ((v & 0x0000FF00U) == 0) return (len + 1);
        if ((v & 0x00FF0000U) == 0) return (len + 2);
        if ((v & 0xFF000000U) == 0) return (len + 3);
        str += sizeof(unsigned int);
        len += sizeof(unsigned int);
    }
}
Peter Kankowski, 9 months ago

Thank you very much, Dmitry. Your function is also 29 bytes shorter when compiled with MSVC++ 2005.

Spanky, 7 months ago

"Four bytes are examined at once." Really? How odd. This _looks_ like it's supposed to be C code, so let's apply C rules:

1) A byte has a _minimum_ size, defined by its required range, of some 8 bits, but might be 32 or more - and often is. 2) An unsigned has a minimum size, defined by its required range, of 16 bits. 3) A string's length can exceed the range of an int, so the code can produce negative results. A string of length -17? Neat trick. 4) If I recall correctly, identifiers with names of the form str* are reserved.

Thus, we have no idea how many bytes are being examined, but we do know that as far as C (or, for that matter, C++) is concerned, the code is only required to access 16 bits at a time, and it is not required to work at all if the data isn't aligned properly for access as an unsigned.

Rather than worrying about code that's "fast", how about worrying about code that's actually correct, first? If there's no requirement that the code actually functions correctly, I can make a much faster version for you:

int mystrlen(const char *s ) {

return 17;

}

You'd be hard pressed to beat it's performance, and unlike your version, it is at least expected to work.

ace, 7 months ago

To Spanky: Which compiler actually uses "char" of more than 8 bits? Do you know of any? I don't mean "wchar_t", I mean "char".

(On another side, I believe that it would be possible to make a C compiler which would generate code for old machines which used 6 bits for char, had 24-bit words and were bit addressable. So I agree that using "char" and "unsigned" as presented here is not portable. Not to mention endianness. There should probably be some line in the article that mentiones it's all for x86 world.)

Peter Kankowski, 7 months ago

Spanky, thank you for the comments. I added "for x86 only" line. A generalized code for any processor would be too complex (see SSE2 optimised strlen as an example). The compilers might have a different size of unsigned (usually from 16 to 64 bits) and endianness, so you have to use #ifdefs for that.

3), 4) Thanks, I corrected this.

Your name:
Comment: