Abstract:
Unrolling the loop in strlen function.

Created by Peter Kankowski
Last changed
Contributors: Ace, Vladimir Kraljevic, and Madis Kalme
Filed under Low-level code optimization

Share on social sitesReddit Digg Delicious Buzz Facebook Twitter

Fast strlen function

Disclaimer: The strlen function rarely lies on the critical path of the program, and if it is, you should store the string length in an integer variable (Pascal-style strings). This article should be viewed as an exercise in code optimization, not as a recommendation for everyday programming.

Vectorized strlen

A usual implementation of strlen function scans the string byte-by-byte looking for terminating zero. For example:

size_t strlen(const char *s) {
    const char *start = s;
    while(*s)
        s++;
    return s - start;
}

It's small and easy, but not very fast. That is how we can improve it:

// for x86 only
size_t my_strlen(const char *s) {
    size_t len = 0;
    for(;;) {
        unsigned x = *(unsigned*)s;
        if((x & 0xFF) == 0) return len;
        if((x & 0xFF00) == 0) return len + 1;
        if((x & 0xFF0000) == 0) return len + 2;
        if((x & 0xFF000000) == 0) return len + 3;
        s += 4, len += 4;
    }
}

Four bytes are examined at once. The program reads a double word from memory, extracts each of its bytes by ANDing with a mask, and compares the bytes with zero. That is what Agner Fog calls "vector operations in general purpose registers".

Problems and solutions

Warning: this function will crash if an non-readable memory page is located right after the end of the string. The simplest way to prevent this is to allocate 3 additional bytes at the end of string.

The dwords may be unaligned, but x86 architecture allows access to unaligned data. For small strings, the alignment will take more time than the penalty of unaligned reads.

The code is not portable: you will have to add another 4 conditions if you use a 64-bit processor. For big-endian architectures, the order of conditions should be reversed.

Agner Fog made another strlen function for his tutorials (file optimizing_assembly.pdf, chapter 13.5). The idea is the same, but he uses clever math tricks to avoid branches inside the loop.

Test conditions and results

The functions were tested on several short strings (words in Gettysburg address) and on a long string (Lewis Carroll's Jabberwocky). The RDTSC instruction was used for measurements.

FunctionShort stringsThe long string
strlen by Microsoft14712104
strlen_my10191471
strlen by Agner Fog12791056

Agner Fog's function is faster for the long string, while strlen_my performs better on the short strings.

Download the test program (6 Kb)

Related articles

SSE2 optimised strlen by Dmitry Kostjuchenko

Implementing strcmp, strlen, and strstr using SSE 4.2 instructions by Peter Kankowski

Peter Kankowski
Peter Kankowski

About the author

Peter is the developer of Aba Search and Replace, a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP.

Created by Peter Kankowski
Last changed
Contributors: Ace, Vladimir Kraljevic, and Madis Kalme

68 comments

AC,
Measurements where inconvenient data are dropped out are not good measurements. Statistics is important. In the normal execution, you'll have to work with different data, and different lengths, so everything counts. What you have to do to have realistic measurements is to simulate realistic conditions. That would typically mean to generate a) the strings of different length b) enough input data so that not everything always fits in the cache.
The easy way to achieve a) which is repeatable (in a way that I can download your program, run, for example, on my P4 and have the same input data as you) is to use RND generator, seeding it with a known seed before you start generating data. Then you generate a lot of data of some given average length (for example, 50 MB of data) and then measure with (you guessed it) GetPerformanceCounter. Such tests are repeatable on different processors and can show you surprising differencies between different processors -- what's very fast on one can be very bad on another (as a small example -- the code with more function calls in critical loop is much faster on P-M than on P-4). Moreover, the tests lie less compared to real use since you don't test it always with so little data that always everything happens only inside of the cache or inside of the processor.
Peter Kankowski,
There are two types of profiling: realistic measurements you are talking about and also synthetic measurements. Both types are useful. Realistic measurements can help if you have a real program, so you can locate its hotspots, optimize them, and see whether the overall performance improves. If the test shows slowdown, you can try another approach to optimization. And, yes, people use various generators and scripts to simulate user's data and user's actions in such tests. Time could be measured with GetPerformanceCounter or even GetTickCount, because it's typically about several seconds or so.

But the whole example with strlen is very unrealistic. I can't imagine a program in which strlen will be a hotspot (i.e., a piece of code where the program spends most of its execution time). Though, you can construct a test to measure the duration of one or several strlen operations. That is what I call synthetic measurements. Agner Fog uses a similar technique to measure the duration of single operations (such as ADD or MUL) for his instruction tables. You can also extract a short and frequently called function from your program and optimize it singly, using synthetic tests to measure your achievements.

Synthetic tests are usually repeated 10-20 times. The first runs show longer durations because of uncached data access or branch mispredictions. These effects are hard to reproduce; they are also insignificant because if the function in called 4000 times, it doesn't matter how much time the first 5 runs will take, but it does matter how will do the next 3995 runs.

Sometimes, a task switch or an interrupt will occur during the test. It looks like that:

 strlen:       2168 ticks <-+ first run effects
 strlen:       2144 ticks   |
 strlen:       2117 ticks   |
 strlen:       2139 ticks <-+
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2823 ticks <-+ a surge probably caused by task switch
 strlen:       2112 ticks <-+
 strlen:       2107 ticks
 strlen:       2107 ticks


It's hard to repeat, too, and you have to neglect such interrupts.

For synthetic tests, you should use RDTSC, because GetPerformanceCounter is too imprecise for short time intervals. Agner Fog describes the method in optimizing_assembly.pdf, chapter 18 "Measuring performance".

And about caching, it's hard to imagine that someone will use strlen for very large strings that will not fit in cache (if someone do this, he definitely should look for a better solution, e.g., storing the length of the string in a variable :). Strlen is typically used for short strings (path name, user name, etc.). Such data may be in L1 cache, in L2 cache, in RAM, or even in a swap file. It depends on data usage pattern and it's hard to reproduce if you have not a real program where strlen is a hotspot. That's why I used synthetic tests here.

I will repeat this again: both kinds of measurements are useful if used properly. Realistic tests are not as precise as synthetic tests, but they show you the effect that your optimization has on the overall program performance, and it's the most important thing.
AC,
You're right. Agner Fog measures the speed of ADD or MUL with RDTSC and small number of repetitions is because this method is best suited for his goal.
Now when you program something like a substitute for a library function you usually have some other goals -- like making it worth to use instead of the existing library function on the common platforms and not only on the processor in your own computer. If you are using it as the substitute for the lib function, you are interested to know how it performs on average, not on specially tuned cases where branch mispredictions don't happen, because that's impossible to have in real usage -- if you don't have mispredictions, you probably don't need such a piece of code anyway.
And you're of course 100% right that storing the length with the string can be the right solution for cases where program does too much strlens.
Still, when you make something, try to do some tests on different processors (like P-4 and P-M), measuring real time (in seconds) to do something that looks close enough to the real job -- I expect that you are definitely going to have some surprises. And for speed optimizations what's really counting is only how fast you can process a lot of data (and usually on more than one computer). A little can be processed by anything anyway.
Быстрая strlen &laquo; СоНоты,
[...] Вот наткнулся на, как оказалось, очередное изобретение велосипеда: быстрая реализация strlen, функция strlen_my из блога smallcode, хотя также был найден аналог этого алгоритма на ассемблере от 1998 года Поля Сие (Paul Hsieh). [...]
icestudent,
Так точно, велосипед. Описанный приём известен в мануалах интела и амд как векторизация.

А ещё непонятно, откуда взята "Microsoft’s strlen". Т.к. даже в отладочной сборке используется crt-код из strlen.asm, содержимое которого отличается лишь сигнатурой функции (в VS 7.1 и 8).
Peter Kankowski,
IceStudent said that MS VC++ 2003/2005 has a strlen function that is very similar to mine (see strlen.asm file in CRT sources).

But, I must add, this optimized function will be used only if you set /Oi- compiler option (Enable Intrinsics = No). By default, if you use /O2 (Maximize Speed) or /Ox (Full Optimization), the /Oi option will be set on, and you have to turn it off manually (add /Oi- to Project Properties > Command Line > Additional Options).

Or, just use my function :).
Chris Severance,
The code is no good because it has a bounds error that will occur when the end of the string falls just before an unallocated page boundry. You must not read beyond the and you can't avoid that reading more than 1 character at a time. It would only work if you could ask your caller to pad their strings to sizeof(int) bytes at which time it would be substantially simpler to code because you would only need to check every sizeof(int) bytes.

Optimizing str() functions is a waste of time. They are used because they are simple not because they are efficient. If you want efficiency at any size you take the length once and pass it among many mem() functions. Optimally you don't ever calculate the length because your string supplier provides you with one.
Alex Davies,
Chris is right, the above code is dangerous to use. Unless your memory manager uses a few bytes after every allocation, there's no guarantee that you won't get a memory access violation.

The FastCode library for Delphi has highly optimized strlen functions that get around this - even with the 128 bit SSE2 versions. The code for these looks something like:

lea ecx, [eax+16]
test ecx, $ff0
jz @@NearPageEnd
@@WithinPage:
...
@@NearPageEnd:
mov edx, eax
@@Loop:
cmp byte ptr [eax], 0
je @@SetResult2
add eax, 1
test eax, 15
jnz @@Loop
jmp @@WithinPage

Taking advantage of the fact that a pages are aligned on 4096 byte boundaries (by my understanding).
Peter Kankowski,
Alex, thanks for your comment.

The easiest way to prevent this error is to allocate 3 extra bytes for each string and initialize them to zeros (so, a string will have 4 terminating zero bytes instead of one). Chris said about this, too: 4 == sizeof(int).

In this case, strcmp, strcpy, and other functions can also operate on DWORDs (or ints), so they will be fast and simple. No special processing for the remaining 1-3 bytes at the end of the string will be needed.

IMHO, it's stupid to use SSE2 to optimize strlen. If your string is so long, you should store its length somewhere and not use the loop over the string to determine its length. Strlen is basically an O(N) loop, no matter how much you optimize it.
ac,
A. Fog's "clever math tricks" for the first time appeared on the interent as the C code in the post Alan Mycroft wrote in Usent comp.arch on Apr 8, 1987 (yes some 20 years ago, as RISC architecture started to be used):

"You might be interested to know that such detection of null bytes in words can be done in 3 or 4 instructions on almost any hardware (nay even in C). (Code that follows relies on x being a 32 bit unsigned (or 2's complement int with overflow ignored)...)

   #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) 


Then if e is an expression without side effects (e.g. variable)

   has_nullbyte_(e) 


is nonzero iff the value of e has a null byte. (One can view this as explicit programming of the Manchester carry chain present in many processors which is hidden from the instruction level).
On Acorn's ARM risc machine using this feature as Brian Case suggests speeded up dhrystone by 20-25%"
Vladimir Kraljevic,
Hi,

I don't agree with all the claims posted.
First, your methods of measuring the efficiency of the functions are saturated by few uncounted factors:

1. When measuring something we'll get more accurate results if tests are run for longer time
2. Context switches are unlikely to occur frequently under Windows when you are running in realtime priority thread
3. Your code assumes that it is executing on single core machine - you didn't set affinity mask, so you could even read another core's time stamp counter (currently no CPU in existance has Invariant TSC's)
4. Main optimization of your code, as you'll see, occurs with proper data alignment and removing unnecessary "len" variable
5. Another obstacle in measurement is printf function in the middle of the critical loop, that can yield weird timings

So now, your function looks like this:


__declspec(align(16)) int __stdcall strlen_yours(const char *s)
{
__declspec(align(16)) const char *r = s;
for(;;)
{
__declspec(align(16)) unsigned x = *((unsigned *)s);
if((x & 0x000000FF) == 0) return (int)(s - r);
if((x & 0x0000FF00) == 0) return (int)((s - r) + 1);
if((x & 0x00FF0000) == 0) return (int)((s - r) + 2);
if((x & 0xFF000000) == 0) return (int)((s - r) + 3);
s += 4;
}
}


And this "traditional" assembly function like this:


__declspec(align(16)) int __stdcall strlen_traditional(const char* s)
{
__declspec(align(16)) int result = 0;
__asm
{
xor eax, eax
mov edi, [s]
mov ecx, eax
dec ecx
repne scasb
not ecx
dec ecx
mov eax, ecx
mov [result], eax
}
return result;
}


I tested these with Microsoft C++ Compiler and Intel C++ 10.0.027, results are... How to tell... Weird?
Believe me, I'm also surprised by the results... It seems that Intel knows a little bit more about it's own architecture.
Five times faster code execution, even on the AMD CPU is a little bit too much, but I'm pretty sure that these results are correct.

I measured the number of iterations performed during specified period of time, per function.
If you agree, this way you get more accurate results - how many times you can perform the same function in limited time interval.
Contrary to some assumptions, repXX prefixes are still in the game, and obviously are fastest in the West :)
Perhaps on some other architecture, it will not be the case, but let's collect some results first, so we can compare.

The following tests were performed on AMD x64 X2 5600+ CPU, with 1024KB of L2 cache per core.

You can download modified source code and the binaries here:
http://www.borghesia.com/code/peter.zip

Command line output:

C:\Work\Kanowski\Release>Kanowski_Microsoft.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x1
Env: Thread ideal processor 0

strlen - short strings (x86):
Finished Default - with 25038911 iterations over 12000 ms
Finished Peter's original - with 53662840 iterations over 12000 ms
Finished Peter's optimized - with 82889344 iterations over 12000 ms
Finished Agner Fog's assembly - with 110818331 iterations over 12000 ms
Finished Traditional assembly - with 133129466 iterations over 12000 ms

strlen - long strings (x86):
Finished Default - with 16659936 iterations over 12000 ms
Finished Peter's original - with 47309270 iterations over 12000 ms
Finished Peter's optimized - with 77931448 iterations over 12000 ms
Finished Agner Fog's assembly - with 108467888 iterations over 12000 ms
Finished Traditional assembly - with 124982814 iterations over 12000 ms

C:\Work\Kanowski\Release>Kanowski_Intel.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x1
Env: Thread ideal processor 0

strlen - short strings (x86):
Finished Default - with 217500254 iterations over 12000 ms
Finished Peter's original - with 441657769 iterations over 12000 ms
Finished Peter's optimized - with 668833221 iterations over 12000 ms
Finished Agner Fog's assembly - with 701593142 iterations over 12000 ms
Finished Traditional assembly - with 724449336 iterations over 12000 ms

strlen - long strings (x86):
Finished Default - with 228646631 iterations over 12000 ms
Finished Peter's original - with 454222319 iterations over 12000 ms
Finished Peter's optimized - with 679860555 iterations over 12000 ms
Finished Agner Fog's assembly - with 710564707 iterations over 12000 ms
Finished Traditional assembly - with 727093980 iterations over 12000 ms

Cheers,
Vladimir
Madis Kalme,
I took this one and ran through my Core 2 Duo T7200 @2.00GHz, its with a 4MB shared L2 cache. The result differences aren't that impressive because what Core architecture did was decrease latencies and increase performance of slow instructions so bad code also performs better :)
I have yet to test it as a native 64-bit application (now it ran in WOW64-mode).

D:\Documents and Settings\Administrator\Desktop\tests>Kanowski_Microsoft.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x1
Env: Thread ideal processor 0

strlen - short strings (x86):
Finished Default - with 14581667 iterations over 12000 ms
Finished Peter's original - with 34420379 iterations over 12000 ms
Finished Peter's optimized - with 53623397 iterations over 12000 ms
Finished Agner Fog's assembly - with 70715876 iterations over 12000 ms
Finished Traditional assembly - with 77313332 iterations over 12000 ms

strlen - long strings (x86):
Finished Default - with 16095626 iterations over 12000 ms
Finished Peter's original - with 32171382 iterations over 12000 ms
Finished Peter's optimized - with 48249741 iterations over 12000 ms
Finished Agner Fog's assembly - with 69516110 iterations over 12000 ms
Finished Traditional assembly - with 73369120 iterations over 12000 ms

D:\Documents and Settings\Administrator\Desktop\tests>Kanowski_Intel.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x1
Env: Thread ideal processor 0

strlen - short strings (x86):
Finished Default - with 46771442 iterations over 12000 ms
Finished Peter's original - with 93445100 iterations over 12000 ms
Finished Peter's optimized - with 140225969 iterations over 12000 ms
Finished Agner Fog's assembly - with 161522516 iterations over 12000 ms
Finished Traditional assembly - with 168330988 iterations over 12000 ms

strlen - long strings (x86):
Finished Default - with 46733163 iterations over 12000 ms
Finished Peter's original - with 93433140 iterations over 12000 ms
Finished Peter's optimized - with 140150691 iterations over 12000 ms
Finished Agner Fog's assembly - with 161667933 iterations over 12000 ms
Finished Traditional assembly - with 165528894 iterations over 12000 ms

Regards,
Madis

Update: The benchmarking code contained a bug, so these results are invalid. See the discussion below.
ac,
Sorry, Vladimir, but your program is broken so much that it's not worth explaining and your claims are not based on anything relevant. You have a lot to learn.
Vladimir Kraljevic,
Well, the only thing that is "broken" is that I had to convert it in order to Intel C++ compiler swallows it (GetRDTSC() function was initially __fastcall, that's why I have one mistake in the strlen_Traditional() function), so the final version should have one instruction less:


__declspec(align(16)) int __stdcall strlen_traditional(const char* s)
{
__declspec(align(16)) int result = 0;
__asm
{
xor eax, eax
mov edi, [s]
mov ecx, eax
dec ecx
repne scasb
not ecx
dec ecx
mov [result], ecx
}
return result;
}


And the second thing, I must apologise to Peter, his familly name is Kankowski, not Kanowski. I'm sorry Peter!

AC: And regarding your comment, you actually said nothing at all.

Besides, you don't have a single argument to back a single word you said. Show us anything you wrote / optimized (and please explain, if the code is broken, how the Peter's code optimized by me runs faster? - My proof is available). Secondly, strlen_Traditional is named after the original MSDOS code which used this construction, not my function :)

Unless you show us your talents, please stay away from the discussion, or perhaps I should - maybe this is not right place to share the opinion; Peter, please moderate.

ac probably stands for (A)nonymous (C)oward?
Madis Kalme,
Okay, the results are in. As you can see 64-bit gives strlen only a mild boost (1% - 2%) and there should be another version of Agner's algorithm under 64-bit with wider tests. I also removed the unnecessary move just before putting ecx back to memory (in the traditional asm variant).

strlen - short strings (x64):
Finished Default              - with   47180255 iterations over 12000 ms
Finished Peter's original     - with   94500072 iterations over 12000 ms
Finished Peter's optimized    - with  141820205 iterations over 12000 ms
Finished Agner Fog's assembly - with  161002095 iterations over 12000 ms
Finished Traditional assembly - with  167928843 iterations over 12000 ms

strlen - long strings (x64):
Finished Default              - with   47324770 iterations over 12000 ms
Finished Peter's original     - with   94649438 iterations over 12000 ms
Finished Peter's optimized    - with  141971792 iterations over 12000 ms
Finished Agner Fog's assembly - with  163672178 iterations over 12000 ms
Finished Traditional assembly - with  167573645 iterations over 12000 ms


Madis

Update: The benchmarking code contained a bug, so these results are invalid. See the discussion below.
Madis Kalme,
Final update (replaced with 64-bit assembly):

strlen - short strings (x64):
...
Finished Agner Fog's assembly - with 158704981 iterations over 12000 ms
...
strlen - long strings (x64):
...
Finished Agner Fog's assembly - with 169442729 iterations over 12000 ms
...
ace,
Vladimir inserted his "results" in his messages but not how he got them, otherwise it would be obvious how wrong he is. This is the code with which he "measures" (taken from his link and without "debug" parts):

dwEnd = ::GetTickCount() + SAMPLING_INTERVAL;
do
{
    ALIGN UINT64 start = GetRDTSC();
    testfunc(strlenfunc);
    ALIGN UINT64 time = GetRDTSC() - start;
    dwCounter += (UINT32)time;
    dwNow = ::GetTickCount();
    g_dwTotalIterations++;
}
while(dwNow < dwEnd);
printf("Finished %10u iters\n", g_dwTotalIterations );


Obviously, he doesn't measure "strlenfunc". Since he doesn't know what he measures, I'll leave it to other readers to explain. Please don't communicate with Vladimir directly until he gets humbler.
ace,
I took a second look at Vladimir's code: any method he'd test at the end would be the best. This is my last comment of Vladimir's "code" or his claims.
Madis Kalme,
Actually you can hardly ever get the raw results and even if you do, these won't matter because you will almost always have to add the overhead of calling the function itself.

Because all the measurements were done in an identical environment, they can be compared to each-other. Of course you can't relate them to any other results gotten from other OS/CPU/measuring algorithm configuration.

Madis
ace,
Madis, did you compile his code? Why don't you reorder the lines in which he invokes tests?

Instead:
	TestStrLen(name_default, 
	TestStrLen(name_peter, 
	TestStrLen(name_peter_new, 
	TestStrLen(name_agner_fog,
	TestStrLen(name_trad, 


Try:

	TestStrLen(name_trad, 
	TestStrLen(name_default, 
	TestStrLen(name_agner_fog,
	TestStrLen(name_peter_new, 
	TestStrLen(name_peter, 


And inform us what you can conclude.
Madis Kalme,
:| How could I be that stupid - there's no delta - these are all absolute ticks.

(Bang my head x 3) me so blind :P

Anyway, on 64-bit, traditional ASM seems to be the slowest and default~peter~optimized_peter are on the same level. I'm still wondering why Agner's is so much slower than any of these three :S

Sorry again,
Madis
Vladimir Kraljevic,
Now, that's the argument! :)

Sorry "ac" or "ace", you're right regarding the measurement, my mistake.

It would be so much easier if swallow your pride and next time simply show immediately what's wrong, that way we could make faster code faster.

But, results are still there (with Intel C++ optimizing compiler, Peter's function is the winner, leaving Agner Fog's and traditional implementation far behind (but we cannot compare I386 with native x64 code, so we have to wait for Agner's update), my optimization makes it even more successfull :)

Updated results (and code is on my site as well http://www.borghesia.com/code/peter.zip) looks like:


C:\Work\Kankowski\Release>Kankowski_Microsoft.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x2
Env: Thread ideal processor 0x1

Test1
Sanity check for Default yields 145
Sanity check for Peter's original yields 145
Sanity check for Peter's optimized yields 145
Sanity check for Agner Fog's assembly yields 145
Sanity check for Traditional assembly yields 145
Test2
Sanity check for Default yields 926
Sanity check for Peter's original yields 926
Sanity check for Peter's optimized yields 926
Sanity check for Agner Fog's assembly yields 926
Sanity check for Traditional assembly yields 926

strlen - short strings (x86):
Finished Traditional assembly - with 22375081 iterations over 12000 ms
Finished Agner Fog's assembly - with 29198690 iterations over 12000 ms
Finished Peter's optimized - with 28557076 iterations over 12000 ms
Finished Peter's original - with 28534699 iterations over 12000 ms
Finished Default - with 24722537 iterations over 12000 ms

strlen - long strings (x86):
Finished Traditional assembly - with 16307313 iterations over 12000 ms
Finished Agner Fog's assembly - with 30159599 iterations over 12000 ms
Finished Peter's optimized - with 30248880 iterations over 12000 ms
Finished Peter's original - with 30273804 iterations over 12000 ms
Finished Default - with 16467480 iterations over 12000 ms

C:\Work\Kankowski\Release>Kankowski_Intel.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x2
Env: Thread ideal processor 0x1

Test1
Sanity check for Default yields 145
Sanity check for Peter's original yields 145
Sanity check for Peter's optimized yields 145
Sanity check for Agner Fog's assembly yields 145
Sanity check for Traditional assembly yields 145
Test2
Sanity check for Default yields 926
Sanity check for Peter's original yields 926
Sanity check for Peter's optimized yields 926
Sanity check for Agner Fog's assembly yields 926
Sanity check for Traditional assembly yields 926

strlen - short strings (x86):
Finished Traditional assembly - with 22396916 iterations over 12000 ms
Finished Agner Fog's assembly - with 32896169 iterations over 12000 ms
Finished Peter's optimized - with 225898470 iterations over 12000 ms
Finished Peter's original - with 221394773 iterations over 12000 ms
Finished Default - with 221423927 iterations over 12000 ms

strlen - long strings (x86):
Finished Traditional assembly - with 16335173 iterations over 12000 ms
Finished Agner Fog's assembly - with 37920384 iterations over 12000 ms
Finished Peter's optimized - with 224262402 iterations over 12000 ms
Finished Peter's original - with 221292361 iterations over 12000 ms
Finished Default - with 222935383 iterations over 12000 ms
Vladimir Kraljevic,
It seems that Intel C++ optimizing compiler does better dead code elimination.

Regarding the suspicious too good Intel results, I've found out that MSVC++ doesn't properly handle dead code even if it is told so... Grmph #@%$!!!

These are final results - Agner's function is much faster than any other even with "Optimizing compilers". Peter's function is much faster when completely eliminated :(

C:\Work\Kanowski\Release>Kankowski_Intel.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x2
Env: Thread ideal processor 0x1

Test1
Sanity check for Default yields 145
Sanity check for Peter's original yields 145
Sanity check for Peter's optimized yields 145
Sanity check for Agner Fog's assembly yields 145
Sanity check for Traditional assembly yields 145
Test2
Sanity check for Default yields 926
Sanity check for Peter's original yields 926
Sanity check for Peter's optimized yields 926
Sanity check for Agner Fog's assembly yields 926
Sanity check for Traditional assembly yields 926

strlen - short strings (x86):
Finished Traditional assembly - with 22266431 iterations over 12000 ms
Finished Agner Fog's assembly - with 32086429 iterations over 12000 ms
Finished Peter's optimized - with 26472001 iterations over 12000 ms
Finished Peter's original - with 22154576 iterations over 12000 ms
Finished Default - with 27178874 iterations over 12000 ms

strlen - long strings (x86):
Finished Traditional assembly - with 16251249 iterations over 12000 ms
Finished Agner Fog's assembly - with 38119397 iterations over 12000 ms
Finished Peter's optimized - with 24963566 iterations over 12000 ms
Finished Peter's original - with 21281613 iterations over 12000 ms
Finished Default - with 18453774 iterations over 12000 ms
Vladimir Kraljevic,
The only conclusions I can draw from this whole thing is:

1. Why to buy and use "optimizing compilers" when manually written code in assembly for simplest functions is still few times faster?
What about little bit more complex functions?

2. Proper alignment is obviously crucial nowdays.

3. Contrary to some manuals, Agner Fog's function performs better. Watch out.

4. Peter's code should run fastest according to todays CPU manufacturers' "best practices" (search for it on developer.intel.com and developer.amd.com), disregarding the fact that it needs trailing bytes in order to prevent possible access violations. But it is not.

5. Bad news are that, by my best guess, instruction pairing (as you can see from Agner's function) has more impact on our code than we ever imagined. That leads us to loose-loose situation where both assembly language written functions and compiled ones are behaving few times slower if they doesn't obey these rules. On the other side, these rule slightly differs regarding the CPU manufacturer and CPU type which leads us into even deeper abyss.
ace,
Just a note for anybody else reading this topic: I've checked the last Vladimir's code and he still doesn't have a clue what he actually measures. Any numbers he publishes are still not useful. It seems that Madis doesn't understand the critical issues too. But I don't intend to teach V. something he missed if he slept during his classes in the school, especially after he trolled too much this topic instead of trying to think and learn something. I'm not going to comment here more until he publishes the really correct code. The current one is still wrong to measure what he believe that he measures.
Peter Kankowski,
Vladimir, thank you for your contribution. Your optimized function is faster on my tests, too.

As AC noted, your benchmarking code has some problems:

> When measuring something we'll get more accurate results if tests are run for longer time
That's not always true. You can make very precise measurements with RDTSC, so you do not need to repeat the tests for short and fast functions (see Agner Fog's manual - optimizing_assembly.pdf, chapter "Measuring performance").

> __declspec(align(16))
You do not need to align every function, the compiler does it automatically (look at your executable under disassembler). The excessive alignment of variables does not provide you any benefits (test it with and without alignment and see the difference).

> Your code assumes that it is executing on single core machine
Yes, your code is better in this respect.

rdtsc
mov [result_hi], edx
mov [result_lo], eax


By x86 calling conventions, 64-bit values are returned in EDX:EAX, so you do not need to copy them to memory and back. Again: look at your code under disassembler or use listing files generated by compiler.

BTW, MSVC++ has an intrinsic function __rdtsc, which you can use (look for "__rdtsc" in MSDN).

What is the purpose of dwCounter variable and GetRDTSC() measurements in your code?

Finally, you call GetTickCount() on each iteration of the inner loop. Win32 API calls are "heavy"; such call may take more time than the strlen() function itself. The better way to measure throughput:

// pseudocode, not tested
UINT REPEAT = 1000;
UINT64 start = GetRDTSC();
for(UINT i = 0; i < REPEAT; i++)
tested_func();
UINT64 time = GetRDTSC() - start;
UINT64 throughput = ((UINT64)REPEAT * TSC_frequency) / time;


Well, Vladimir, your benchmarks are not perfect (neither are mine), but you have a lot of interesting ideas. Feel free to report them.


Agner Fog's code is faster than your optimized version on Core 2, but slower on Pentium M (on both your and my benchmarks). It may be due to longer pipeline in Core 2 or due to some other microarchitectural detail. It would be interesting to see the results on Pentium 4. Does anybody here have P4?

Madis, thank you very much for results on Core 2 and 64-bit mode.
ace-a-bad-cop,
I can run something on P4 but not before Sunday (I'm travelling now) and only if I like the code. ;>
I still don't know which ideas of Vladimir are relevant except of "you can use end - begin instead od maintaining length". Almost everything else he got plain dead wrong.
Madis Kalme,
I would prefer if the administrator edited my first 3 misleading posts. I hope there could be some message persuading the reader to continue to the next messages. That is all.

Thank You!
Madis
Peter Kankowski,
Madis, I inserted a notice about invalid results.
smallcode &raquo; Blog Archive &raquo; Using sentinel in string manipulation,
Another method to speed up string manipulation
Brian Hurt,

You missed an obvious optimization. You can detect a null in a four-byte word like:

    unsigned y = x & 0x7F7F7F7Fu;
    y += 0x7F7F7F7F;
    y |= x;
    if ((y & 0x80808080) != 0x80808080u) {
        /* x has a nul- check each byte */
    } else {
        /* x does not have a null */
    }

The advantage this approach has is fewer jumps- and it uses only simple (1-clock) instructions, and 2 registers- allowing you to unroll this loop 2 or 3 times and give the execution units plenty to do.

Paolo Bonzini,

Regarding invalid accesses, one possibility is this (sorry for the goto, so much easier to write it this way):

len = 0;
if ((((intptr_t)p) & (sizeof(long)-1)) == 0) goto loop;
if (*p++ == 0) return len; else len++;
if ((((intptr_t)p) & (sizeof(long)-1)) == 0) goto loop;
if (*p++ == 0) return len; else len++;
if ((((intptr_t)p) & (sizeof(long)-1)) == 0) goto loop;
if (*p++ == 0) return len; else len++;
if ((((intptr_t)p) & (sizeof(long)-1)) == 0) goto loop;
if (*p++ == 0) return len; else len++;
loop:
    for(;;) {
        unsigned x = *(unsigned*)s;
        if((x & 0xFF) == 0) return len;
        if((x & 0xFF00) == 0) return len + 1;
        if((x & 0xFF0000) == 0) return len + 2;
        if((x & 0xFF000000) == 0) return len + 3;
        s += 4, len += 4;
    }

By aligning the address at the beginning, you ensure the code will never perform memory accesses that span two pages.

Peter Kankowski,

Thank you, it's a nice idea! I would rewrite it with a fall-through switch (similar to Duff's device) to avoid the excess conditions:

int strlen_my(const char *s) {
    int len = 0;
    const char *p = s;
    switch ( (intptr_t)p & 3 ) {
        case 3:
            if (*p++ == 0)
               return p - s;
        case 2:
            if (*p++ == 0)
               return p - s;
        case 1:
            if (*p++ == 0)
               return p - s;
    }
 
    for(;;) {
        unsigned x = *(unsigned*)p;
        if((x & 0xFF) == 0) return p - s;
        if((x & 0xFF00) == 0) return p - s + 1;
        if((x & 0xFF0000) == 0) return p - s + 2;
        if((x & 0xFF000000) == 0) return p - s + 3;
        p += 4;
    }
}

The benchmark shows that this code is slower on short strings, but slightly faster on a long string because of aligned memory access. Thank you again for the idea.

Paolo Bonzini,

Can you find where is the break-even point WRT Agner's code now? Probably this is as good as you can be for reasonably sized strings; and safer than your original code which makes the comparison with your code apples-to-oranges, basically.

(Note that you are trading up to 3 conditions for always 2 conditions – or no change at all if the compiler decides it's not worth using a decision tree implementation of the switch statement – so not much is changing regarding the number of jumps).

k06a,
s+=4; => ((unsigned*)s)++;
// Increment is more usefull i think.
Dmitry Kostjuchenko,

I would propose aligning in while{} loop to avoid excessive conditions, this shall outperform fall-through switch as well, at least in my syntetic tests by +1.5%.

size_t __aux_strlen(const char *str)
{
    register size_t len = 0;
    // align pointer to 32 bits
    while ((((intptr_t)str) & (sizeof(unsigned int)-1)) != 0)
    {
        if (*str ++ == 0) 
            return len; 
        ++ len;
    }
    // search for 0
    for (;;)
    {
        unsigned int v = *(unsigned int *)str;
        if ((v & 0x000000FFU) == 0) return (len);
        if ((v & 0x0000FF00U) == 0) return (len + 1);
        if ((v & 0x00FF0000U) == 0) return (len + 2);
        if ((v & 0xFF000000U) == 0) return (len + 3);
        str += sizeof(unsigned int);
        len += sizeof(unsigned int);
    }
}
Peter Kankowski,

Thank you very much, Dmitry. Your function is also 29 bytes shorter when compiled with MSVC++ 2005.

Spanky,

"Four bytes are examined at once." Really? How odd. This _looks_ like it's supposed to be C code, so let's apply C rules:

1) A byte has a _minimum_ size, defined by its required range, of some 8 bits, but might be 32 or more - and often is. 2) An unsigned has a minimum size, defined by its required range, of 16 bits. 3) A string's length can exceed the range of an int, so the code can produce negative results. A string of length -17? Neat trick. 4) If I recall correctly, identifiers with names of the form str* are reserved.

Thus, we have no idea how many bytes are being examined, but we do know that as far as C (or, for that matter, C++) is concerned, the code is only required to access 16 bits at a time, and it is not required to work at all if the data isn't aligned properly for access as an unsigned.

Rather than worrying about code that's "fast", how about worrying about code that's actually correct, first? If there's no requirement that the code actually functions correctly, I can make a much faster version for you:

int mystrlen(const char *s ) {

return 17;

}

You'd be hard pressed to beat it's performance, and unlike your version, it is at least expected to work.

ace,

To Spanky: Which compiler actually uses "char" of more than 8 bits? Do you know of any? I don't mean "wchar_t", I mean "char".

(On another side, I believe that it would be possible to make a C compiler which would generate code for old machines which used 6 bits for char, had 24-bit words and were bit addressable. So I agree that using "char" and "unsigned" as presented here is not portable. Not to mention endianness. There should probably be some line in the article that mentiones it's all for x86 world.)

Peter Kankowski,

Spanky, thank you for the comments. I added "for x86 only" line. A generalized code for any processor would be too complex (see SSE2 optimised strlen as an example). The compilers might have a different size of unsigned (usually from 16 to 64 bits) and endianness, so you have to use #ifdefs for that.

3), 4) Thanks, I corrected this.

Konstantin Belochapka,

Another implementation of strlen() uses the same approach, outperforms run-time function by 120% on larger strings

#define  SIZEOF_SIZE_T     (sizeof(size_t))
#define  SIZEOF_SIZE_T_BITS     (8*(SIZEOF_SIZE_T))

size_t __fast_strlen(const char *str)
{
   register size_t len = 0;

   // align pointer to natural size_t aligment
   while ((((size_t)str) & (SIZEOF_SIZE_T - 1)) != 0) {
      if (*str++ == 0) {
         return len;
      }
      ++len;
   }

   size_t n = (size_t)0x7F7F7F7F7F7F7F7FLL; 

   // search for 0 - little endian
   for(;;) {
      size_t v = *(size_t *)str;
      if (v == 0) return (len);

      size_t y = (v & n) + n;     // convert each null byte into 0x80 and each non null byte into 0x00
      y = ~(y | v | n);
      if( y == 0) {   // no null bytes
         str += SIZEOF_SIZE_T;
         len += SIZEOF_SIZE_T;
         continue;
      }

      size_t  yy0  = y & 0xFFFFFFFFLL;
      if( yy0 == 0) {  yy0 = y >> (SIZEOF_SIZE_T_BITS/2); len += SIZEOF_SIZE_T/2; }

      size_t  yy1  = yy0 & 0xFFFFLL;
      if( yy1 == 0) {  yy1 = yy0 >> (SIZEOF_SIZE_T_BITS/4); len += SIZEOF_SIZE_T/4; }

      size_t  yy2  = yy1 & 0xFFLL;
      if( yy2 == 0) {  yy2 = yy1 >> (SIZEOF_SIZE_T_BITS/8); len += SIZEOF_SIZE_T/8; }

      return len;
   }

   return len;  // newer gets here, make compiler happy
}

if anyone needs I can paste a template that works with wchar_t and on big endian machine.

Thanks Peter fro your idea.

  
Konstantin Belochapka,

forgot to mention, this is for 64 bit compiler (size_t = 8)

Konstantin Belochapka,

Of course nothing beats SSE 4.2 PCMPxSTRx instructions. Finally they introduced an instruction that can significantly speed up all memxxx(), strxxx() functions.

Konstantin Belochapka,

For those who do not like resort to an assembler, here is an strlen() that uses microsoft C compiler intrinsic.

It outperforms run time library strlen() 500 percent.

#define EQUAL_ANY	0x00    // 0000b
#define RANGES		0x04    // 0100b
#define EQUAL_EACH	0x08    // 1000b
#define EQUAL_ORDERED	(0x8+0x04)    // 1100b
#define NEGATIVE_POLARITY  0x10     // 010000b
#define BYTE_MASK	   0x40     // 1000000b


size_t  sse4_strlen(const char *str)
{
   size_t len = 0;
   __m128i  z, c;

   z = _mm_setzero_si128();

   int idx;
   while(true) {

      c = _mm_loadu_si128( (const __m128i*)(str) ); 
      idx = _mm_cmpistri(z, c, EQUAL_EACH);   //returns index of the first byte that matches
      if(idx < sizeof(__m128i)) break;
      len += sizeof(__m128i);
      str += sizeof(__m128i); 
   }
   return len + idx;
}



 
Konstantin Belochapka,

Obviously all other strxxx() family of functions can be easily implemented

For example strchr()

#define EFLAG_CARRY     (0x0001LL<<0)
#define EFLAG_PARITY    (0x0001LL<<2)
#define EFLAG_ADJUST    (0x0001LL<<4)
#define EFLAG_ZERO      (0x0001LL<<6)
#define EFLAG_SIGN      (0x0001LL<<7)
#define EFLAG_OVERFLOW  (0x0001LL<<11)

size_t  sse4_strchr(const char *str, char ch)
{
   size_t len = 0;
   __m128i  z, c;
   __int64  eflags;

   z = _mm_set1_epi8(ch);  // convert character to a string of characters

   int idx;
   while(true) {

      c = _mm_loadu_si128( (const __m128i*)(str) ); 
      idx = _mm_cmpistri(z, c, EQUAL_EACH);   // returns index of a first matched byte or 16 otherwise
      eflags = __readeflags();
      if(idx < sizeof(__m128i)) break;
      if(eflags & EFLAG_ZERO) {   // Zero Flag is set if operand contains zero byte
         return (size_t(-1));
      }
      len += sizeof(__m128i);
      str += sizeof(__m128i); 
   }
   return len + idx;
}

Also performance in comparison with run time library function increase is very significant

Peter Kankowski,
Konstantin, thank you very much for the code.
mischa sandberg,

You don't need sse4 to get this speed-up; pcmpeq + movemask + bfs do the job quite well.

Here's an sse2 equivalent, with attention to using aligned loads. It's gcc 4.4.

Pardon my dense style and presumed #includes

#define xmload(p)   _mm_load_si128((__m128i const*)p)
#define xmatch(a,b) _mm_movemask_epi8(_mm_cmpeq_epi8(a, b))

char const *sse2_strchr(char const *str, char ch)
{
    __m128i     x, zero = {}, p0 = _mm_set1_epi8(ch);
    unsigned    z, m, f = 15 & (uintptr_t)str;

    // Initial unaligned chunk of tgt:
    if (f) {
        z = xmatch(x = xmload(str - f), zero) >> f;
        m = (xmatch(x,p0) >> f) & ~z & (z - 1)
        if (m) return tgt + ffs(m) - 1;
        if (z) return NULL;
        tgt += 16 - f;
    }

    // 16-byte aligned chunks of str:
    for (; !(z = xmatch(x = xmload(str), zero)); str += 16)
        if ((m = xmatch(x,p0)))
            return str + ffs(m) - 1;

    // Final 0..15 bytes of tgt:
    if ((m = xmatch(x,p0) & ~z & (z - 1)))
        return str + ffs(m) - 1;

    return NULL;
}

Peter Kankowski,

Thank you. Agner Fog uses the same approach for strlen in his asmlib. Unfortunately, even after small corrections your code fails the tests.

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <assert.h>
#include <emmintrin.h>

#define xmload(p)   _mm_load_si128((__m128i const*)p)
#define xmatch(a,b) _mm_movemask_epi8(_mm_cmpeq_epi8(a, b))
#ifdef __GNUC__
#define ctz(a) __builtin_ctz(a)
#elif defined(_MSC_VER)
#include <intrin.h>
inline unsigned long ctz(unsigned long x) {
	unsigned long index;
	_BitScanForward(&index, x);
	return index;
}
#endif

// by mischa sandberg
char const *sse2_strchr(char const *str, char ch)
{
    __m128i     x, zero = {}, p0 = _mm_set1_epi8(ch);
    unsigned    z, m, f = 15 & (uintptr_t)str;

    // Initial unaligned chunk of tgt:
    if (f) {
        z = xmatch(x = xmload(str - f), zero) >> f;
        m = (xmatch(x,p0) >> f) & ~z & (z - 1);
        if (m) return str + ctz(m);
        if (z) return NULL;
        str += 16 - f;
    }

    // 16-byte aligned chunks of str:
    for (; !(z = xmatch(x = xmload(str), zero)); str += 16)
        if ((m = xmatch(x,p0)))
            return str + ctz(m);

    // Final 0..15 bytes of tgt:
    if ((m = xmatch(x,p0) & ~z & (z - 1)))
        return str + ctz(m);

    return NULL;
}


int main() {
    assert( strcmp( sse2_strchr("abc", 'a'), "abc" ) ==0 );
    assert( NULL == sse2_strchr("", 'a') );
    assert( strcmp( sse2_strchr("abc", 'b'), "bc" ) ==0 );
    assert( NULL == sse2_strchr("abc", 'z') );
    
    assert( "st" == sse2_strchr("abc def ghi jkl mno pqr st", 's') );
    assert( "mno pqr st" == sse2_strchr("abc def ghi jkl mno pqr st", 'm') );
    return 0;
}
(crashes on the first xmload with access violation; tested under MSVC++ and GCC)
mischa sandberg,

Gah! Sorry. To save space I compressed xmload/xmatch from static inline functions into #define's.

The fix is:

#define xmload(p)   _mm_load_si128((__m128i const*)(p))

mischa sandberg,

Also, I'm fascinated that

   assert("st" == sse2_strchr(...)) 

works, as opposed to

   assert(!strcmp("st", sse2_strchr(...)))

With the compile switches I use, that doesn't get past a compile error:

"comparison with string literal results in unspecified behavior"

mischa sandberg,

The above logic can be extended to a strstr that scans for a match on the first two chars of the pattern.

Sorry for not working out the MSVC details here.

$ cat toad.c
#include <string.h>
#include <stdint.h>
#include <emmintrin.h>
static inline int under(int x) { return ~x & (x - 1); }
#define ctz(m)      (ffs(m) - 1)
#define xmload(p)   _mm_load_si128((__m128i const*)(p))
#define xmatch(a,b) _mm_movemask_epi8(_mm_cmpeq_epi8(a, b))
static char const *scanchr2(char const *str, char const pat[2])
{
    __m128i     x, zero = {};
    __m128i     p0 = _mm_set1_epi8(pat[0]), p1 = _mm_set1_epi8(pat[1]);
    uint16_t    pair = *(uint16_t const*)pat;
    unsigned    z, m, f = 15 & (uintptr_t)str;

    // Initial unaligned chunk of str:
    if (f) {
        z = xmatch(x = xmload(str - f), zero) >> f;
        m = under(z) & ((xmatch(x,p0) & (xmatch(x,p1) >> 1)) >> f);
        if (m) return str + ctz(m);
        if (z) return NULL;
        str += 16 - f;
        if (*(uint16_t const*)(str - 1) == pair)
            return str - 1;
    }

    // 16-byte aligned chunks of str:
    while (!(z = xmatch(x = xmload(str), zero))) {
        m = xmatch(x,p0) & (xmatch(x,p1) >> 1);
        if (m) return str + ctz(m);
        str += 16;
        if (*(uint16_t const*)(str - 1) == pair)
            return str - 1;
    }

    // Final 0..15 bytes of str:
    m = under(z) & xmatch(x,p0) & (xmatch(x,p1) >> 1);
    return m ? str + ctz(m) : NULL;
}

static char const *sse2_strstr(char const *str, char const *pat)
{
    unsigned patlen = strlen(pat);
    if (patlen == 0) return str;
    if (patlen == 1) return strchr(str, *pat);
    while ((str = scanchr2(str, pat)))
        if (!memcmp(str+2, pat+2, patlen-2))
            return str;
    return NULL;
}

int main(void)
{
    return strcmp("world", sse2_strstr("hello, world", "world"));
}

$ gmake toad
cc -g -MMD -O9 -fno-strict-aliasing -fstack-protector --param ssp-buffer-size=4 -Wall -Werror -Wextra -Wcast-align -Wcast-qual -Wformat=2 -Wformat-security -Winline -Wmissing-prototypes -Wnested-externs -Wpointer-arith -Wredundant-decls -Wshadow -Wstrict-prototypes -Wunused -Wwrite-strings -O0 -msse3 -D_GNU_SOURCE -I/usr/local/include  -c -o toad.o toad.c
cc -L/usr/local/lib  toad.o  -ldl -lresolv -o toad
$ ./toad
$ echo $?
0

mischa sandberg,

The above runs at about 10-15 times the speed of the glibc 4.4.5 strstr for any pattern less than 32 bytes.

This includes edge cases like the match occurring at offset 0 of the target string.

Note that glibc strlen (and hence glibc strstr) benefits from the same strlen that in fact uses the basic above idea.

There's a speed drop to 2-3 times the speed of the glibc strstr at 32 bytes ... still chewing on why.

mischasan.wordpress.com for the things I've been finding.

mischa sandberg,

And, umm, scratch the -O0 in the above "cc" line if you want to see the 10-15 times speedup :-)

Peter Kankowski,
Also, I'm fascinated that assert("st" == sse2_strchr(...)) works

I'm sorry, it certainly should be strcmp. The two-char strstr is a great idea. I've subscribed to your blog.

Misaligned memory accesses on x86,

On x86 and x86_64 (on core I7), there is no penalty for any misaligned memory access within a cache line, and for misaligned accesses crossing a cache line, the penalty is one cycle. You might as well consider misaligned accesses completely free. AMD architectures also have negligible penalties for misalignment.

I think I know why: I was doing performance analysis once and I enabled alignment checking in the eflags register and setup my debugger to catch the alignment traps and record the stack traces. To my surprise, I was getting millions of misaligned accesses throughout windows code and in the CRT. My theory is, Intel engineers did a similar investigation and found misaligned accesses to be happening so often, that they realized that they needed to make it super fast. In Core I7, the handling of misaligned accesses is absolutely amazing.

ace,

The first Intel processors x86 with no penalty for some misaligned memory access that I know of were some Core 2 models on 45 nm. The ones on 65 nm were significantly slower.

Brad Conroy,

you could eliminate the len+=4 from the loop by using a local char* instead of s.

char *buf=s;
...
... return buf-s;
... return buf-s+1;
... return buf-s+2;
... return buf-s+3;
buf+=4;
Peter Kankowski,

Brad, thank you very much for noticing. Your version has one instruction less in the loop. However, len += 4 can be executed in parallel with s += 4 on a superscalar processor. As far as I remember, both versions have the same speed for this reason.

Denis Novikov,

Hi Peter,

I've found that standard strlen works faster than your implementation with Ubuntu Linux 12.4 x86_64 (g++-4.6.3).

I disassembled `libc.so` and saw that strlen function checks the cpu features and calls one of the functions: __strlen_sse2_pminub, __strlen_sse2, __strlen_sse42 or __strlen_sse2_no_bsf.

Denis Novikov,

Sometimes standard `strlen` function works 8x times faster (with the long string).

Peter Kankowski,

Hi Denis,

thank you for noticing. My friend Dmitry wrote a faster SSE2 implementation:

http://www.strchr.com/sse2_optimised_strlen

Denis Novikov,

Peter, I've seen the sse2 implementation by Dmitry.

For the short string your implementation is the fastest (1.4x faster than the glibc's implementation and 1.5x faster than the Dmitry's one).

For the long string your implementation is the slowest (7.5x slower than the glibc's and 4.3x slower than the Dmitry's).

So the fastest implementation is the `standard` glibc's one.

Peter Kankowski,
One year ago, they added a fast implementation by Ondřej Bílka to glibc. I can only congratulate the glibc developers for making it faster than my code. :)
Denis Novikov,

Ubuntu 12.04 is `LTS` and contains 2 yrs. old version of `glibc`. It would be really interesting to test the newest version of `strlen`.

RHyde,

FWIW, here is an implementation that prevents crossing MMU pages using SSE instructions. String addresses are passed in RSI and RDI.

; strcmp2-
;
;  String comparison using pcmpistri instruction 
; and computing string lengths ahead of time.

strcmp2     proc

xmm0Save	textequ	<[rsp]>
xmm1Save	textequ	<[rsp+16]>

            push	rsi
            push    rdi
            push    rcx
            sub	rsp, 32
            movdqu	xmm0Save, xmm0
            movdqu	xmm1Save, xmm1

            ; Make RDI dependent on RSI so we can
            ; step through RDI by incrementing RSI:

            sub	rdi, rsi

            ; Okay, 16-byte align string 1 (pointed
            ; at by RSI):

paraAlign:	mov	al, [rsi]
	cmp	al, [rdi][rsi*1]
	jne	different

	; Move on to next character

	inc	rsi
		
	; Check for end of string:

	cmp	al, 0
	je	same
	; Now we need to see if we've aligned RSI on
	; a 16-byte boundary

	test	rsi, 0fh
	jnz	paraAlign
	
; RSI is now paragraph aligned. We can fetch blocks of
; 16 bytes at [RSI] without fear of a general protection
; fault. However, we don't know what RDI's alignment is,
; so we have to test to see if it's legal to fetch 16 bytes
; at a time from RDI (which is really [rdi][rsi*1] at this
; point).

	sub	  rsi, 16
scLoop:	add	  rsi, 16		;On to next block.
scLoop2:	lea	  rcx, [rdi][rsi*1]	;Check the src2
	and	  rcx, 0fffh		; block to see if
	cmp	  rcx, 0fh		; there are at		
	jbe	  lt16inPage		; least 16 bytes
					; left on MMU page.

; If we have at least 16 bytes left on the MMU page for the
; src2 block, then use pcmpistri to compare the 16 bytes
; at src1 (which we know is completely on an MMU page as
; RSI is 16-byte aligned) against the 16 bytes at src2
; (RDI+RSI). Load src1 bytes into XMM1 and use src2 as
; the pcmpistri operand (because we can use movdqa for
; src1, as it is aligned, and pcmpistri allows non-aligned
; accesses).
	
isAligned:	movdqa	  xmm1, [rsi]
	pcmpistri xmm1, [rsi][rdi*1], scFlags
	ja	  scLoop	;Equal, no zero bytes
	jc	  different2	;Not equal
	
; At this point, the zero flag must be set, so there
; must have been a zero byte in src1 or src2. As the
; characters also match, the strings must be equal.

same:	xor	rax, rax
	jmp	exit
	
; lt16inPage-
;
;  Code transfers to this label when there are fewer
; than 16 characters left in the src2 memory page.
; We must compare byte-by-byte until we hit a zero
; byte or cross the MMU page boundary.
;
; Note that if RCX is zero at this point, then
; RCX is already 16-byte aligned and we can jump
; right back up to the loop above. 

lt16inPage:	jrcxz	isAligned
cmpUpTo16:	mov	al, [rsi]
	cmp	al, [rdi][rsi*1]
	jne	different
	inc	rsi
	cmp	al, 0
	je	same
	dec	rcx
	jnz	cmpUpTo16
	jmp	paraAlign

; Branch to this point from the code where we were
; aligning RSI to a 16-byte boundary and found a
; different character (betwen RSI and RDI).

different2:	add	rsi, rcx	
different:	mov	al, [rsi]
	sub	al, [rdi][rsi*1]
	movsx	rax, al
exit:	movdqa	xmm0, xmm0Save
	movdqa	xmm1, xmm1Save
	add	rsp, 32
	pop     rcx
            pop     rdi
            pop     rsi
            ret
            
strcmp2     endp

Peter Kankowski,
Thank you, it's something I missed in the blog post.
Nicolas ,

1st check if the start of your string is sizeof(unsigned) aligned.

If not check the first few bytes that are not aligned one by one.

Then check the aligned unsigned.

With this, no need to put an extra 3 bytes at the end as you will never hit a page that you can’t read, hence no crash.

And your code will be faster as you will read aligned unsigned values

Peter Kankowski,
Hi Nicolas, thank you so much. Yes, it's a good solution. The code will be a bit longer, but no need to put the extra 3 bytes as you said

Your name:


Comment: