Abstract:
Unrolling the loop in strlen function.

Created by Peter Kankowski
Last changed
Contributors: Ace, Vladimir Kraljevic, and Madis Kalme
Filed under Low-level code optimization

Share on social sitesReddit Digg Delicious Buzz Facebook Twitter

Fast strlen function

Disclaimer: The strlen function rarely lies on the critical path of the program, and if it is, you should store the string length in an integer variable (Pascal-style strings). This article should be viewed as an exercise in code optimization, not as a recommendation for everyday programming.

Vectorized strlen

A usual implementation of strlen function scans the string byte-by-byte looking for terminating zero. For example:

size_t strlen(const char *s) {
    const char *start = s;
    while(*s)
        s++;
    return s - start;
}

It's small and easy, but not very fast. That is how we can improve it:

// for x86 only
size_t my_strlen(const char *s) {
    size_t len = 0;
    for(;;) {
        unsigned x = *(unsigned*)s;
        if((x & 0xFF) == 0) return len;
        if((x & 0xFF00) == 0) return len + 1;
        if((x & 0xFF0000) == 0) return len + 2;
        if((x & 0xFF000000) == 0) return len + 3;
        s += 4, len += 4;
    }
}

Four bytes are examined at once. The program reads a double word from memory, extracts each of its bytes by ANDing with a mask, and compares the bytes with zero. That is what Agner Fog calls "vector operations in general purpose registers".

Problems and solutions

Warning: this function will crash if an non-readable memory page is located right after the end of the string. The simplest way to prevent this is to allocate 3 additional bytes at the end of string.

The dwords may be unaligned, but x86 architecture allows access to unaligned data. For small strings, the alignment will take more time than the penalty of unaligned reads.

The code is not portable: you will have to add another 4 conditions if you use a 64-bit processor. For big-endian architectures, the order of conditions should be reversed.

Agner Fog made another strlen function for his tutorials (file optimizing_assembly.pdf, chapter 13.5). The idea is the same, but he uses clever math tricks to avoid branches inside the loop.

Test conditions and results

The functions were tested on several short strings (words in Gettysburg address) and on a long string (Lewis Carroll's Jabberwocky). The RDTSC instruction was used for measurements.

FunctionShort stringsThe long string
strlen by Microsoft14712104
strlen_my10191471
strlen by Agner Fog12791056

Agner Fog's function is faster for the long string, while strlen_my performs better on the short strings.

Download the test program (6 Kb)

Related articles

SSE2 optimised strlen by Dmitry Kostjuchenko

Implementing strcmp, strlen, and strstr using SSE 4.2 instructions by Peter Kankowski

Peter Kankowski
Peter Kankowski

About the author

Peter is the developer of Aba Search and Replace, a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP.

Created by Peter Kankowski
Last changed
Contributors: Ace, Vladimir Kraljevic, and Madis Kalme

68 comments

Ten recent comments are shown below. Show all comments

Denis Novikov,

Hi Peter,

I've found that standard strlen works faster than your implementation with Ubuntu Linux 12.4 x86_64 (g++-4.6.3).

I disassembled `libc.so` and saw that strlen function checks the cpu features and calls one of the functions: __strlen_sse2_pminub, __strlen_sse2, __strlen_sse42 or __strlen_sse2_no_bsf.

Denis Novikov,

Sometimes standard `strlen` function works 8x times faster (with the long string).

Peter Kankowski,

Hi Denis,

thank you for noticing. My friend Dmitry wrote a faster SSE2 implementation:

http://www.strchr.com/sse2_optimised_strlen

Denis Novikov,

Peter, I've seen the sse2 implementation by Dmitry.

For the short string your implementation is the fastest (1.4x faster than the glibc's implementation and 1.5x faster than the Dmitry's one).

For the long string your implementation is the slowest (7.5x slower than the glibc's and 4.3x slower than the Dmitry's).

So the fastest implementation is the `standard` glibc's one.

Peter Kankowski,
One year ago, they added a fast implementation by Ondřej Bílka to glibc. I can only congratulate the glibc developers for making it faster than my code. :)
Denis Novikov,

Ubuntu 12.04 is `LTS` and contains 2 yrs. old version of `glibc`. It would be really interesting to test the newest version of `strlen`.

RHyde,

FWIW, here is an implementation that prevents crossing MMU pages using SSE instructions. String addresses are passed in RSI and RDI.

; strcmp2-
;
;  String comparison using pcmpistri instruction 
; and computing string lengths ahead of time.

strcmp2     proc

xmm0Save	textequ	<[rsp]>
xmm1Save	textequ	<[rsp+16]>

            push	rsi
            push    rdi
            push    rcx
            sub	rsp, 32
            movdqu	xmm0Save, xmm0
            movdqu	xmm1Save, xmm1

            ; Make RDI dependent on RSI so we can
            ; step through RDI by incrementing RSI:

            sub	rdi, rsi

            ; Okay, 16-byte align string 1 (pointed
            ; at by RSI):

paraAlign:	mov	al, [rsi]
	cmp	al, [rdi][rsi*1]
	jne	different

	; Move on to next character

	inc	rsi
		
	; Check for end of string:

	cmp	al, 0
	je	same
	; Now we need to see if we've aligned RSI on
	; a 16-byte boundary

	test	rsi, 0fh
	jnz	paraAlign
	
; RSI is now paragraph aligned. We can fetch blocks of
; 16 bytes at [RSI] without fear of a general protection
; fault. However, we don't know what RDI's alignment is,
; so we have to test to see if it's legal to fetch 16 bytes
; at a time from RDI (which is really [rdi][rsi*1] at this
; point).

	sub	  rsi, 16
scLoop:	add	  rsi, 16		;On to next block.
scLoop2:	lea	  rcx, [rdi][rsi*1]	;Check the src2
	and	  rcx, 0fffh		; block to see if
	cmp	  rcx, 0fh		; there are at		
	jbe	  lt16inPage		; least 16 bytes
					; left on MMU page.

; If we have at least 16 bytes left on the MMU page for the
; src2 block, then use pcmpistri to compare the 16 bytes
; at src1 (which we know is completely on an MMU page as
; RSI is 16-byte aligned) against the 16 bytes at src2
; (RDI+RSI). Load src1 bytes into XMM1 and use src2 as
; the pcmpistri operand (because we can use movdqa for
; src1, as it is aligned, and pcmpistri allows non-aligned
; accesses).
	
isAligned:	movdqa	  xmm1, [rsi]
	pcmpistri xmm1, [rsi][rdi*1], scFlags
	ja	  scLoop	;Equal, no zero bytes
	jc	  different2	;Not equal
	
; At this point, the zero flag must be set, so there
; must have been a zero byte in src1 or src2. As the
; characters also match, the strings must be equal.

same:	xor	rax, rax
	jmp	exit
	
; lt16inPage-
;
;  Code transfers to this label when there are fewer
; than 16 characters left in the src2 memory page.
; We must compare byte-by-byte until we hit a zero
; byte or cross the MMU page boundary.
;
; Note that if RCX is zero at this point, then
; RCX is already 16-byte aligned and we can jump
; right back up to the loop above. 

lt16inPage:	jrcxz	isAligned
cmpUpTo16:	mov	al, [rsi]
	cmp	al, [rdi][rsi*1]
	jne	different
	inc	rsi
	cmp	al, 0
	je	same
	dec	rcx
	jnz	cmpUpTo16
	jmp	paraAlign

; Branch to this point from the code where we were
; aligning RSI to a 16-byte boundary and found a
; different character (betwen RSI and RDI).

different2:	add	rsi, rcx	
different:	mov	al, [rsi]
	sub	al, [rdi][rsi*1]
	movsx	rax, al
exit:	movdqa	xmm0, xmm0Save
	movdqa	xmm1, xmm1Save
	add	rsp, 32
	pop     rcx
            pop     rdi
            pop     rsi
            ret
            
strcmp2     endp

Peter Kankowski,
Thank you, it's something I missed in the blog post.
Nicolas ,

1st check if the start of your string is sizeof(unsigned) aligned.

If not check the first few bytes that are not aligned one by one.

Then check the aligned unsigned.

With this, no need to put an extra 3 bytes at the end as you will never hit a page that you can’t read, hence no crash.

And your code will be faster as you will read aligned unsigned values

Peter Kankowski,
Hi Nicolas, thank you so much. Yes, it's a good solution. The code will be a bit longer, but no need to put the extra 3 bytes as you said

Your name:


Comment: