Disclaimer: The strlen function rarely lies on the critical path of the program, and if it is, you should store the string length in an integer variable (Pascal-style strings). This article should be viewed as an exercise in code optimization, not as a recommendation for everyday programming.
Vectorized strlen
A usual implementation of strlen function scans the string byte-by-byte looking for terminating zero. For example:
size_t strlen(const char *s) {
const char *start = s;
while(*s)
s++;
return s - start;
}
It's small and easy, but not very fast. That is how we can improve it:
// for x86 only
size_t my_strlen(const char *s) {
size_t len = 0;
for(;;) {
unsigned x = *(unsigned*)s;
if((x & 0xFF) == 0) return len;
if((x & 0xFF00) == 0) return len + 1;
if((x & 0xFF0000) == 0) return len + 2;
if((x & 0xFF000000) == 0) return len + 3;
s += 4, len += 4;
}
}
Four bytes are examined at once. The program reads a double word from memory, extracts each of its bytes by ANDing with a mask, and compares the bytes with zero. That is what Agner Fog calls "vector operations in general purpose registers".
Problems and solutions
Warning: this function will crash if an non-readable memory page is located right after the end of the string. The simplest way to prevent this is to allocate 3 additional bytes at the end of string.
The dwords may be unaligned, but x86 architecture allows access to unaligned data. For small strings, the alignment will take more time than the penalty of unaligned reads.
The code is not portable: you will have to add another 4 conditions if you use a 64-bit processor. For big-endian architectures, the order of conditions should be reversed.
Agner Fog made another strlen function for his tutorials (file optimizing_assembly.pdf, chapter 13.5). The idea is the same, but he uses clever math tricks to avoid branches inside the loop.
Test conditions and results
The functions were tested on several short strings (words in Gettysburg address) and on a long string (Lewis Carroll's Jabberwocky). The RDTSC instruction was used for measurements.
Function | Short strings | The long string |
---|---|---|
strlen by Microsoft | 1471 | 2104 |
strlen_my | 1019 | 1471 |
strlen by Agner Fog | 1279 | 1056 |
Agner Fog's function is faster for the long string, while strlen_my performs better on the short strings.
Download the test program (6 Kb)
Related articles
SSE2 optimised strlen by Dmitry Kostjuchenko
Implementing strcmp, strlen, and strstr using SSE 4.2 instructions by Peter Kankowski
68 comments
Ten recent comments are shown below. Show all comments
Hi Peter,
I've found that standard strlen works faster than your implementation with Ubuntu Linux 12.4 x86_64 (g++-4.6.3).
I disassembled `libc.so` and saw that strlen function checks the cpu features and calls one of the functions: __strlen_sse2_pminub, __strlen_sse2, __strlen_sse42 or __strlen_sse2_no_bsf.
Sometimes standard `strlen` function works 8x times faster (with the long string).
Hi Denis,
thank you for noticing. My friend Dmitry wrote a faster SSE2 implementation:
http://www.strchr.com/sse2_optimised_strlen
Peter, I've seen the sse2 implementation by Dmitry.
For the short string your implementation is the fastest (1.4x faster than the glibc's implementation and 1.5x faster than the Dmitry's one).
For the long string your implementation is the slowest (7.5x slower than the glibc's and 4.3x slower than the Dmitry's).
So the fastest implementation is the `standard` glibc's one.
Ubuntu 12.04 is `LTS` and contains 2 yrs. old version of `glibc`. It would be really interesting to test the newest version of `strlen`.
FWIW, here is an implementation that prevents crossing MMU pages using SSE instructions. String addresses are passed in RSI and RDI.
1st check if the start of your string is sizeof(unsigned) aligned.
If not check the first few bytes that are not aligned one by one.
Then check the aligned unsigned.
With this, no need to put an extra 3 bytes at the end as you will never hit a page that you can’t read, hence no crash.
And your code will be faster as you will read aligned unsigned values