Abstract:
Unrolling the loop in strlen function.

Created 16 years ago by Peter Kankowski
Last changed 14 years ago
Contributors: Ace, Vladimir Kraljevic, and Madis Kalme
Filed under Low-level code optimization

Fast strlen function

Disclaimer: The strlen function rarely lies on the critical path of the program, and if it is, you should store the string length in an integer variable (Pascal-style strings). This article should be viewed as an exercise in code optimization, not as a recommendation for everyday programming.

Vectorized strlen

A usual implementation of strlen function scans the string byte-by-byte looking for terminating zero. For example:

size_t strlen(const char *s) {
    const char *start = s;
    while(*s)
        s++;
    return s - start;
}

It's small and easy, but not very fast. That is how we can improve it:

// for x86 only
size_t my_strlen(const char *s) {
    size_t len = 0;
    for(;;) {
        unsigned x = *(unsigned*)s;
        if((x & 0xFF) == 0) return len;
        if((x & 0xFF00) == 0) return len + 1;
        if((x & 0xFF0000) == 0) return len + 2;
        if((x & 0xFF000000) == 0) return len + 3;
        s += 4, len += 4;
    }
}

Four bytes are examined at once. The program reads a double word from memory, extracts each of its bytes by ANDing with a mask, and compares the bytes with zero. That is what Agner Fog calls "vector operations in general purpose registers".

Problems and solutions

Warning: this function will crash if an non-readable memory page is located right after the end of the string. The simplest way to prevent this is to allocate 3 additional bytes at the end of string.

The dwords may be unaligned, but x86 architecture allows access to unaligned data. For small strings, the alignment will take more time than the penalty of unaligned reads.

The code is not portable: you will have to add another 4 conditions if you use a 64-bit processor. For big-endian architectures, the order of conditions should be reversed.

Agner Fog made another strlen function for his tutorials (file optimizing_assembly.pdf, chapter 13.5). The idea is the same, but he uses clever math tricks to avoid branches inside the loop.

Test conditions and results

The functions were tested on several short strings (words in Gettysburg address) and on a long string (Lewis Carroll's Jabberwocky). The RDTSC instruction was used for measurements.

Function	Short strings	The long string
strlen by Microsoft	1471	2104
strlen_my	1019	1471
strlen by Agner Fog	1279	1056

Agner Fog's function is faster for the long string, while strlen_my performs better on the short strings.

Download the test program (6 Kb)

SSE2 optimised strlen by Dmitry Kostjuchenko

Implementing strcmp, strlen, and strstr using SSE 4.2 instructions by Peter Kankowski

About the author

Peter is the developer of Aba Search and Replace, a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP.

Created 16 years ago by Peter Kankowski
Last changed 14 years ago
Contributors: Ace, Vladimir Kraljevic, and Madis Kalme

68 comments

AC, 18 years ago

Measurements where inconvenient data are dropped out are not good measurements. Statistics is important. In the normal execution, you'll have to work with different data, and different lengths, so everything counts. What you have to do to have realistic measurements is to simulate realistic conditions. That would typically mean to generate a) the strings of different length b) enough input data so that not everything always fits in the cache.
The easy way to achieve a) which is repeatable (in a way that I can download your program, run, for example, on my P4 and have the same input data as you) is to use RND generator, seeding it with a known seed before you start generating data. Then you generate a lot of data of some given average length (for example, 50 MB of data) and then measure with (you guessed it) GetPerformanceCounter. Such tests are repeatable on different processors and can show you surprising differencies between different processors -- what's very fast on one can be very bad on another (as a small example -- the code with more function calls in critical loop is much faster on P-M than on P-4). Moreover, the tests lie less compared to real use since you don't test it always with so little data that always everything happens only inside of the cache or inside of the processor.

Peter Kankowski, 18 years ago

There are two types of profiling: realistic measurements you are talking about and also synthetic measurements. Both types are useful. Realistic measurements can help if you have a real program, so you can locate its hotspots, optimize them, and see whether the overall performance improves. If the test shows slowdown, you can try another approach to optimization. And, yes, people use various generators and scripts to simulate user's data and user's actions in such tests. Time could be measured with GetPerformanceCounter or even GetTickCount, because it's typically about several seconds or so.

But the whole example with strlen is very unrealistic. I can't imagine a program in which strlen will be a hotspot (i.e., a piece of code where the program spends most of its execution time). Though, you can construct a test to measure the duration of one or several strlen operations. That is what I call synthetic measurements. Agner Fog uses a similar technique to measure the duration of single operations (such as ADD or MUL) for his instruction tables. You can also extract a short and frequently called function from your program and optimize it singly, using synthetic tests to measure your achievements.

Synthetic tests are usually repeated 10-20 times. The first runs show longer durations because of uncached data access or branch mispredictions. These effects are hard to reproduce; they are also insignificant because if the function in called 4000 times, it doesn't matter how much time the first 5 runs will take, but it does matter how will do the next 3995 runs.

Sometimes, a task switch or an interrupt will occur during the test. It looks like that:

 strlen:       2168 ticks <-+ first run effects
 strlen:       2144 ticks   |
 strlen:       2117 ticks   |
 strlen:       2139 ticks <-+
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2107 ticks
 strlen:       2823 ticks <-+ a surge probably caused by task switch
 strlen:       2112 ticks <-+
 strlen:       2107 ticks
 strlen:       2107 ticks

It's hard to repeat, too, and you have to neglect such interrupts.

For synthetic tests, you should use RDTSC, because GetPerformanceCounter is too imprecise for short time intervals. Agner Fog describes the method in optimizing_assembly.pdf, chapter 18 "Measuring performance".

And about caching, it's hard to imagine that someone will use strlen for very large strings that will not fit in cache (if someone do this, he definitely should look for a better solution, e.g., storing the length of the string in a variable :). Strlen is typically used for short strings (path name, user name, etc.). Such data may be in L1 cache, in L2 cache, in RAM, or even in a swap file. It depends on data usage pattern and it's hard to reproduce if you have not a real program where strlen is a hotspot. That's why I used synthetic tests here.

I will repeat this again: both kinds of measurements are useful if used properly. Realistic tests are not as precise as synthetic tests, but they show you the effect that your optimization has on the overall program performance, and it's the most important thing.

AC, 18 years ago

You're right. Agner Fog measures the speed of ADD or MUL with RDTSC and small number of repetitions is because this method is best suited for his goal.
Now when you program something like a substitute for a library function you usually have some other goals -- like making it worth to use instead of the existing library function on the common platforms and not only on the processor in your own computer. If you are using it as the substitute for the lib function, you are interested to know how it performs on average, not on specially tuned cases where branch mispredictions don't happen, because that's impossible to have in real usage -- if you don't have mispredictions, you probably don't need such a piece of code anyway.
And you're of course 100% right that storing the length with the string can be the right solution for cases where program does too much strlens.
Still, when you make something, try to do some tests on different processors (like P-4 and P-M), measuring real time (in seconds) to do something that looks close enough to the real job -- I expect that you are definitely going to have some surprises. And for speed optimizations what's really counting is only how fast you can process a lot of data (and usually on more than one computer). A little can be processed by anything anyway.

Быстрая strlen « СоНоты, 18 years ago

[...] Вот наткнулся на, как оказалось, очередное изобретение велосипеда: быстрая реализация strlen, функция strlen_my из блога smallcode, хотя также был найден аналог этого алгоритма на ассемблере от 1998 года Поля Сие (Paul Hsieh). [...]

icestudent, 18 years ago

Так точно, велосипед. Описанный приём известен в мануалах интела и амд как векторизация.

А ещё непонятно, откуда взята "Microsoft’s strlen". Т.к. даже в отладочной сборке используется crt-код из strlen.asm, содержимое которого отличается лишь сигнатурой функции (в VS 7.1 и 8).

Peter Kankowski, 18 years ago

IceStudent said that MS VC++ 2003/2005 has a strlen function that is very similar to mine (see strlen.asm file in CRT sources).

But, I must add, this optimized function will be used only if you set /Oi- compiler option (Enable Intrinsics = No). By default, if you use /O2 (Maximize Speed) or /Ox (Full Optimization), the /Oi option will be set on, and you have to turn it off manually (add /Oi- to Project Properties > Command Line > Additional Options).

Or, just use my function :).

Chris Severance, 18 years ago

The code is no good because it has a bounds error that will occur when the end of the string falls just before an unallocated page boundry. You must not read beyond the and you can't avoid that reading more than 1 character at a time. It would only work if you could ask your caller to pad their strings to sizeof(int) bytes at which time it would be substantially simpler to code because you would only need to check every sizeof(int) bytes.

Optimizing str() functions is a waste of time. They are used because they are simple not because they are efficient. If you want efficiency at any size you take the length once and pass it among many mem() functions. Optimally you don't ever calculate the length because your string supplier provides you with one.

Alex Davies, 17 years ago

Chris is right, the above code is dangerous to use. Unless your memory manager uses a few bytes after every allocation, there's no guarantee that you won't get a memory access violation.

The FastCode library for Delphi has highly optimized strlen functions that get around this - even with the 128 bit SSE2 versions. The code for these looks something like:

lea ecx, [eax+16]
test ecx, $ff0
jz @@NearPageEnd
@@WithinPage:
...
@@NearPageEnd:
mov edx, eax
@@Loop:
cmp byte ptr [eax], 0
je @@SetResult2
add eax, 1
test eax, 15
jnz @@Loop
jmp @@WithinPage

Taking advantage of the fact that a pages are aligned on 4096 byte boundaries (by my understanding).

Peter Kankowski, 17 years ago

Alex, thanks for your comment.

The easiest way to prevent this error is to allocate 3 extra bytes for each string and initialize them to zeros (so, a string will have 4 terminating zero bytes instead of one). Chris said about this, too: 4 == sizeof(int).

In this case, strcmp, strcpy, and other functions can also operate on DWORDs (or ints), so they will be fast and simple. No special processing for the remaining 1-3 bytes at the end of the string will be needed.

IMHO, it's stupid to use SSE2 to optimize strlen. If your string is so long, you should store its length somewhere and not use the loop over the string to determine its length. Strlen is basically an O(N) loop, no matter how much you optimize it.

ac, 16 years ago

A. Fog's "clever math tricks" for the first time appeared on the interent as the C code in the post Alan Mycroft wrote in Usent comp.arch on Apr 8, 1987 (yes some 20 years ago, as RISC architecture started to be used):

"You might be interested to know that such detection of null bytes in words can be done in 3 or 4 instructions on almost any hardware (nay even in C). (Code that follows relies on x being a 32 bit unsigned (or 2's complement int with overflow ignored)...)

   #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)

Then if e is an expression without side effects (e.g. variable)

   has_nullbyte_(e)

is nonzero iff the value of e has a null byte. (One can view this as explicit programming of the Manchester carry chain present in many processors which is hidden from the instruction level).
On Acorn's ARM risc machine using this feature as Brian Case suggests speeded up dhrystone by 20-25%"

Vladimir Kraljevic, 16 years ago

Hi,

I don't agree with all the claims posted.
First, your methods of measuring the efficiency of the functions are saturated by few uncounted factors:

1. When measuring something we'll get more accurate results if tests are run for longer time
2. Context switches are unlikely to occur frequently under Windows when you are running in realtime priority thread
3. Your code assumes that it is executing on single core machine - you didn't set affinity mask, so you could even read another core's time stamp counter (currently no CPU in existance has Invariant TSC's)
4. Main optimization of your code, as you'll see, occurs with proper data alignment and removing unnecessary "len" variable
5. Another obstacle in measurement is printf function in the middle of the critical loop, that can yield weird timings

So now, your function looks like this:



__declspec(align(16)) int __stdcall strlen_yours(const char *s)

{

	__declspec(align(16)) const char *r = s;

	for(;;)

	{

		__declspec(align(16)) unsigned x = *((unsigned *)s);

		if((x & 0x000000FF) == 0) return (int)(s - r);

		if((x & 0x0000FF00) == 0) return (int)((s - r) + 1);

		if((x & 0x00FF0000) == 0) return (int)((s - r) + 2);

		if((x & 0xFF000000) == 0) return (int)((s - r) + 3);

		s += 4;

	}

}

And this "traditional" assembly function like this:



__declspec(align(16)) int __stdcall strlen_traditional(const char* s)

{

	__declspec(align(16)) int result = 0;

	__asm

	{

		xor eax, eax

		mov edi, [s]

		mov ecx, eax

		dec ecx

		repne scasb

		not ecx

		dec ecx

		mov eax, ecx

		mov [result], eax

	}

	return result;

}

I tested these with Microsoft C++ Compiler and Intel C++ 10.0.027, results are... How to tell... Weird?
Believe me, I'm also surprised by the results... It seems that Intel knows a little bit more about it's own architecture.
Five times faster code execution, even on the AMD CPU is a little bit too much, but I'm pretty sure that these results are correct.

I measured the number of iterations performed during specified period of time, per function.
If you agree, this way you get more accurate results - how many times you can perform the same function in limited time interval.
Contrary to some assumptions, repXX prefixes are still in the game, and obviously are fastest in the West :)
Perhaps on some other architecture, it will not be the case, but let's collect some results first, so we can compare.

The following tests were performed on AMD x64 X2 5600+ CPU, with 1024KB of L2 cache per core.

You can download modified source code and the binaries here:
http://www.borghesia.com/code/peter.zip

Command line output:

C:\Work\Kanowski\Release>Kanowski_Microsoft.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x1
Env: Thread ideal processor 0

strlen - short strings (x86):
Finished Default - with 25038911 iterations over 12000 ms
Finished Peter's original - with 53662840 iterations over 12000 ms
Finished Peter's optimized - with 82889344 iterations over 12000 ms
Finished Agner Fog's assembly - with 110818331 iterations over 12000 ms
Finished Traditional assembly - with 133129466 iterations over 12000 ms

strlen - long strings (x86):
Finished Default - with 16659936 iterations over 12000 ms
Finished Peter's original - with 47309270 iterations over 12000 ms
Finished Peter's optimized - with 77931448 iterations over 12000 ms
Finished Agner Fog's assembly - with 108467888 iterations over 12000 ms
Finished Traditional assembly - with 124982814 iterations over 12000 ms

C:\Work\Kanowski\Release>Kanowski_Intel.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x1
Env: Thread ideal processor 0

strlen - short strings (x86):
Finished Default - with 217500254 iterations over 12000 ms
Finished Peter's original - with 441657769 iterations over 12000 ms
Finished Peter's optimized - with 668833221 iterations over 12000 ms
Finished Agner Fog's assembly - with 701593142 iterations over 12000 ms
Finished Traditional assembly - with 724449336 iterations over 12000 ms

strlen - long strings (x86):
Finished Default - with 228646631 iterations over 12000 ms
Finished Peter's original - with 454222319 iterations over 12000 ms
Finished Peter's optimized - with 679860555 iterations over 12000 ms
Finished Agner Fog's assembly - with 710564707 iterations over 12000 ms
Finished Traditional assembly - with 727093980 iterations over 12000 ms

Cheers,
Vladimir

Madis Kalme, 16 years ago

I took this one and ran through my Core 2 Duo T7200 @2.00GHz, its with a 4MB shared L2 cache. The result differences aren't that impressive because what Core architecture did was decrease latencies and increase performance of slow instructions so bad code also performs better :)
I have yet to test it as a native 64-bit application (now it ran in WOW64-mode).

D:\Documents and Settings\Administrator\Desktop\tests>Kanowski_Microsoft.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x1
Env: Thread ideal processor 0

strlen - short strings (x86):
Finished Default - with 14581667 iterations over 12000 ms
Finished Peter's original - with 34420379 iterations over 12000 ms
Finished Peter's optimized - with 53623397 iterations over 12000 ms
Finished Agner Fog's assembly - with 70715876 iterations over 12000 ms
Finished Traditional assembly - with 77313332 iterations over 12000 ms

strlen - long strings (x86):
Finished Default - with 16095626 iterations over 12000 ms
Finished Peter's original - with 32171382 iterations over 12000 ms
Finished Peter's optimized - with 48249741 iterations over 12000 ms
Finished Agner Fog's assembly - with 69516110 iterations over 12000 ms
Finished Traditional assembly - with 73369120 iterations over 12000 ms

D:\Documents and Settings\Administrator\Desktop\tests>Kanowski_Intel.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x1
Env: Thread ideal processor 0

strlen - short strings (x86):
Finished Default - with 46771442 iterations over 12000 ms
Finished Peter's original - with 93445100 iterations over 12000 ms
Finished Peter's optimized - with 140225969 iterations over 12000 ms
Finished Agner Fog's assembly - with 161522516 iterations over 12000 ms
Finished Traditional assembly - with 168330988 iterations over 12000 ms

strlen - long strings (x86):
Finished Default - with 46733163 iterations over 12000 ms
Finished Peter's original - with 93433140 iterations over 12000 ms
Finished Peter's optimized - with 140150691 iterations over 12000 ms
Finished Agner Fog's assembly - with 161667933 iterations over 12000 ms
Finished Traditional assembly - with 165528894 iterations over 12000 ms

Regards,
Madis

Update: The benchmarking code contained a bug, so these results are invalid. See the discussion below.

ac, 16 years ago

Sorry, Vladimir, but your program is broken so much that it's not worth explaining and your claims are not based on anything relevant. You have a lot to learn.

Vladimir Kraljevic, 16 years ago

Well, the only thing that is "broken" is that I had to convert it in order to Intel C++ compiler swallows it (GetRDTSC() function was initially __fastcall, that's why I have one mistake in the strlen_Traditional() function), so the final version should have one instruction less:



__declspec(align(16)) int __stdcall strlen_traditional(const char* s)

{

	__declspec(align(16)) int result = 0;

	__asm

	{

		xor eax, eax

		mov edi, [s]

		mov ecx, eax

		dec ecx

		repne scasb

		not ecx

		dec ecx

		mov [result], ecx

	}

	return result;

}

And the second thing, I must apologise to Peter, his familly name is Kankowski, not Kanowski. I'm sorry Peter!

AC: And regarding your comment, you actually said nothing at all.

Besides, you don't have a single argument to back a single word you said. Show us anything you wrote / optimized (and please explain, if the code is broken, how the Peter's code optimized by me runs faster? - My proof is available). Secondly, strlen_Traditional is named after the original MSDOS code which used this construction, not my function :)

Unless you show us your talents, please stay away from the discussion, or perhaps I should - maybe this is not right place to share the opinion; Peter, please moderate.

ac probably stands for (A)nonymous (C)oward?

Madis Kalme, 16 years ago

Okay, the results are in. As you can see 64-bit gives strlen only a mild boost (1% - 2%) and there should be another version of Agner's algorithm under 64-bit with wider tests. I also removed the unnecessary move just before putting ecx back to memory (in the traditional asm variant).

strlen - short strings (x64):
Finished Default              - with   47180255 iterations over 12000 ms
Finished Peter's original     - with   94500072 iterations over 12000 ms
Finished Peter's optimized    - with  141820205 iterations over 12000 ms
Finished Agner Fog's assembly - with  161002095 iterations over 12000 ms
Finished Traditional assembly - with  167928843 iterations over 12000 ms

strlen - long strings (x64):
Finished Default              - with   47324770 iterations over 12000 ms
Finished Peter's original     - with   94649438 iterations over 12000 ms
Finished Peter's optimized    - with  141971792 iterations over 12000 ms
Finished Agner Fog's assembly - with  163672178 iterations over 12000 ms
Finished Traditional assembly - with  167573645 iterations over 12000 ms

Madis

Update: The benchmarking code contained a bug, so these results are invalid. See the discussion below.

Madis Kalme, 16 years ago

Final update (replaced with 64-bit assembly):

strlen - short strings (x64):
...
Finished Agner Fog's assembly - with 158704981 iterations over 12000 ms
...
strlen - long strings (x64):
...
Finished Agner Fog's assembly - with 169442729 iterations over 12000 ms
...

ace, 16 years ago

Vladimir inserted his "results" in his messages but not how he got them, otherwise it would be obvious how wrong he is. This is the code with which he "measures" (taken from his link and without "debug" parts):

dwEnd = ::GetTickCount() + SAMPLING_INTERVAL;
do
{
    ALIGN UINT64 start = GetRDTSC();
    testfunc(strlenfunc);
    ALIGN UINT64 time = GetRDTSC() - start;
    dwCounter += (UINT32)time;
    dwNow = ::GetTickCount();
    g_dwTotalIterations++;
}
while(dwNow < dwEnd);
printf("Finished %10u iters\n", g_dwTotalIterations );

Obviously, he doesn't measure "strlenfunc". Since he doesn't know what he measures, I'll leave it to other readers to explain. Please don't communicate with Vladimir directly until he gets humbler.

ace, 16 years ago

I took a second look at Vladimir's code: any method he'd test at the end would be the best. This is my last comment of Vladimir's "code" or his claims.

Madis Kalme, 16 years ago

Actually you can hardly ever get the raw results and even if you do, these won't matter because you will almost always have to add the overhead of calling the function itself.

Because all the measurements were done in an identical environment, they can be compared to each-other. Of course you can't relate them to any other results gotten from other OS/CPU/measuring algorithm configuration.

Madis

ace, 16 years ago

Madis, did you compile his code? Why don't you reorder the lines in which he invokes tests?

Instead:

	TestStrLen(name_default, 
	TestStrLen(name_peter, 
	TestStrLen(name_peter_new, 
	TestStrLen(name_agner_fog,
	TestStrLen(name_trad,

Try:

	TestStrLen(name_trad, 
	TestStrLen(name_default, 
	TestStrLen(name_agner_fog,
	TestStrLen(name_peter_new, 
	TestStrLen(name_peter,

And inform us what you can conclude.

Madis Kalme, 16 years ago

:| How could I be that stupid - there's no delta - these are all absolute ticks.

(Bang my head x 3) me so blind :P

Anyway, on 64-bit, traditional ASM seems to be the slowest and default~peter~optimized_peter are on the same level. I'm still wondering why Agner's is so much slower than any of these three :S

Sorry again,
Madis

Vladimir Kraljevic, 16 years ago

Now, that's the argument! :)

Sorry "ac" or "ace", you're right regarding the measurement, my mistake.

It would be so much easier if swallow your pride and next time simply show immediately what's wrong, that way we could make faster code faster.

But, results are still there (with Intel C++ optimizing compiler, Peter's function is the winner, leaving Agner Fog's and traditional implementation far behind (but we cannot compare I386 with native x64 code, so we have to wait for Agner's update), my optimization makes it even more successfull :)

Updated results (and code is on my site as well http://www.borghesia.com/code/peter.zip) looks like:



C:\Work\Kankowski\Release>Kankowski_Microsoft.exe

Env: Process affinity mask succeeded

Env: PriorityBoost succeeded

Env: Thread affinity mask 0x2

Env: Thread ideal processor 0x1



Test1

Sanity check for Default yields 145

Sanity check for Peter's original yields 145

Sanity check for Peter's optimized yields 145

Sanity check for Agner Fog's assembly yields 145

Sanity check for Traditional assembly yields 145

Test2

Sanity check for Default yields 926

Sanity check for Peter's original yields 926

Sanity check for Peter's optimized yields 926

Sanity check for Agner Fog's assembly yields 926

Sanity check for Traditional assembly yields 926



strlen - short strings (x86):

Finished Traditional assembly - with   22375081 iterations over 12000 ms

Finished Agner Fog's assembly - with   29198690 iterations over 12000 ms

Finished Peter's optimized    - with   28557076 iterations over 12000 ms

Finished Peter's original     - with   28534699 iterations over 12000 ms

Finished Default              - with   24722537 iterations over 12000 ms



strlen - long strings (x86):

Finished Traditional assembly - with   16307313 iterations over 12000 ms

Finished Agner Fog's assembly - with   30159599 iterations over 12000 ms

Finished Peter's optimized    - with   30248880 iterations over 12000 ms

Finished Peter's original     - with   30273804 iterations over 12000 ms

Finished Default              - with   16467480 iterations over 12000 ms



C:\Work\Kankowski\Release>Kankowski_Intel.exe

Env: Process affinity mask succeeded

Env: PriorityBoost succeeded

Env: Thread affinity mask 0x2

Env: Thread ideal processor 0x1



Test1

Sanity check for Default yields 145

Sanity check for Peter's original yields 145

Sanity check for Peter's optimized yields 145

Sanity check for Agner Fog's assembly yields 145

Sanity check for Traditional assembly yields 145

Test2

Sanity check for Default yields 926

Sanity check for Peter's original yields 926

Sanity check for Peter's optimized yields 926

Sanity check for Agner Fog's assembly yields 926

Sanity check for Traditional assembly yields 926



strlen - short strings (x86):

Finished Traditional assembly - with   22396916 iterations over 12000 ms

Finished Agner Fog's assembly - with   32896169 iterations over 12000 ms

Finished Peter's optimized    - with  225898470 iterations over 12000 ms

Finished Peter's original     - with  221394773 iterations over 12000 ms

Finished Default              - with  221423927 iterations over 12000 ms



strlen - long strings (x86):

Finished Traditional assembly - with   16335173 iterations over 12000 ms

Finished Agner Fog's assembly - with   37920384 iterations over 12000 ms

Finished Peter's optimized    - with  224262402 iterations over 12000 ms

Finished Peter's original     - with  221292361 iterations over 12000 ms

Finished Default              - with  222935383 iterations over 12000 ms

Vladimir Kraljevic, 16 years ago

It seems that Intel C++ optimizing compiler does better dead code elimination.

Regarding the suspicious too good Intel results, I've found out that MSVC++ doesn't properly handle dead code even if it is told so... Grmph #@%$!!!

These are final results - Agner's function is much faster than any other even with "Optimizing compilers". Peter's function is much faster when completely eliminated :(

C:\Work\Kanowski\Release>Kankowski_Intel.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x2
Env: Thread ideal processor 0x1

Test1
Sanity check for Default yields 145
Sanity check for Peter's original yields 145
Sanity check for Peter's optimized yields 145
Sanity check for Agner Fog's assembly yields 145
Sanity check for Traditional assembly yields 145
Test2
Sanity check for Default yields 926
Sanity check for Peter's original yields 926
Sanity check for Peter's optimized yields 926
Sanity check for Agner Fog's assembly yields 926
Sanity check for Traditional assembly yields 926

strlen - short strings (x86):
Finished Traditional assembly - with 22266431 iterations over 12000 ms
Finished Agner Fog's assembly - with 32086429 iterations over 12000 ms
Finished Peter's optimized - with 26472001 iterations over 12000 ms
Finished Peter's original - with 22154576 iterations over 12000 ms
Finished Default - with 27178874 iterations over 12000 ms

strlen - long strings (x86):
Finished Traditional assembly - with 16251249 iterations over 12000 ms
Finished Agner Fog's assembly - with 38119397 iterations over 12000 ms
Finished Peter's optimized - with 24963566 iterations over 12000 ms
Finished Peter's original - with 21281613 iterations over 12000 ms
Finished Default - with 18453774 iterations over 12000 ms

Vladimir Kraljevic, 16 years ago

The only conclusions I can draw from this whole thing is:

1. Why to buy and use "optimizing compilers" when manually written code in assembly for simplest functions is still few times faster?
What about little bit more complex functions?

2. Proper alignment is obviously crucial nowdays.

3. Contrary to some manuals, Agner Fog's function performs better. Watch out.

4. Peter's code should run fastest according to todays CPU manufacturers' "best practices" (search for it on developer.intel.com and developer.amd.com), disregarding the fact that it needs trailing bytes in order to prevent possible access violations. But it is not.

5. Bad news are that, by my best guess, instruction pairing (as you can see from Agner's function) has more impact on our code than we ever imagined. That leads us to loose-loose situation where both assembly language written functions and compiled ones are behaving few times slower if they doesn't obey these rules. On the other side, these rule slightly differs regarding the CPU manufacturer and CPU type which leads us into even deeper abyss.

ace, 16 years ago

Just a note for anybody else reading this topic: I've checked the last Vladimir's code and he still doesn't have a clue what he actually measures. Any numbers he publishes are still not useful. It seems that Madis doesn't understand the critical issues too. But I don't intend to teach V. something he missed if he slept during his classes in the school, especially after he trolled too much this topic instead of trying to think and learn something. I'm not going to comment here more until he publishes the really correct code. The current one is still wrong to measure what he believe that he measures.

Peter Kankowski, 16 years ago

Vladimir, thank you for your contribution. Your optimized function is faster on my tests, too.

As AC noted, your benchmarking code has some problems:

> When measuring something we'll get more accurate results if tests are run for longer time
That's not always true. You can make very precise measurements with RDTSC, so you do not need to repeat the tests for short and fast functions (see Agner Fog's manual - optimizing_assembly.pdf, chapter "Measuring performance").

> __declspec(align(16))
You do not need to align every function, the compiler does it automatically (look at your executable under disassembler). The excessive alignment of variables does not provide you any benefits (test it with and without alignment and see the difference).

> Your code assumes that it is executing on single core machine
Yes, your code is better in this respect.

rdtsc
mov [result_hi], edx
mov [result_lo], eax

By x86 calling conventions, 64-bit values are returned in EDX:EAX, so you do not need to copy them to memory and back. Again: look at your code under disassembler or use listing files generated by compiler.

BTW, MSVC++ has an intrinsic function __rdtsc, which you can use (look for "__rdtsc" in MSDN).

What is the purpose of dwCounter variable and GetRDTSC() measurements in your code?

Finally, you call GetTickCount() on each iteration of the inner loop. Win32 API calls are "heavy"; such call may take more time than the strlen() function itself. The better way to measure throughput:



// pseudocode, not tested

UINT REPEAT = 1000;

UINT64 start = GetRDTSC();

for(UINT i = 0; i < REPEAT; i++)

   tested_func();

UINT64 time = GetRDTSC() - start;

UINT64 throughput = ((UINT64)REPEAT * TSC_frequency) / time;

Well, Vladimir, your benchmarks are not perfect (neither are mine), but you have a lot of interesting ideas. Feel free to report them.

Agner Fog's code is faster than your optimized version on Core 2, but slower on Pentium M (on both your and my benchmarks). It may be due to longer pipeline in Core 2 or due to some other microarchitectural detail. It would be interesting to see the results on Pentium 4. Does anybody here have P4?

Madis, thank you very much for results on Core 2 and 64-bit mode.

ace-a-bad-cop, 16 years ago

I can run something on P4 but not before Sunday (I'm travelling now) and only if I like the code. ;>
I still don't know which ideas of Vladimir are relevant except of "you can use end - begin instead od maintaining length". Almost everything else he got plain dead wrong.

Madis Kalme, 16 years ago

I would prefer if the administrator edited my first 3 misleading posts. I hope there could be some message persuading the reader to continue to the next messages. That is all.

Thank You!
Madis

Peter Kankowski, 16 years ago

Madis, I inserted a notice about invalid results.

smallcode » Blog Archive » Using sentinel in string manipulation, 16 years ago

Another method to speed up string manipulation

Brian Hurt, 15 years ago

You missed an obvious optimization. You can detect a null in a four-byte word like:

    unsigned y = x & 0x7F7F7F7Fu;
    y += 0x7F7F7F7F;
    y |= x;
    if ((y & 0x80808080) != 0x80808080u) {
        /* x has a nul- check each byte */
    } else {
        /* x does not have a null */
    }

The advantage this approach has is fewer jumps- and it uses only simple (1-clock) instructions, and 2 registers- allowing you to unroll this loop 2 or 3 times and give the execution units plenty to do.

Paolo Bonzini, 15 years ago

Regarding invalid accesses, one possibility is this (sorry for the goto, so much easier to write it this way):

len = 0;
if ((((intptr_t)p) & (sizeof(long)-1)) == 0) goto loop;
if (*p++ == 0) return len; else len++;
if ((((intptr_t)p) & (sizeof(long)-1)) == 0) goto loop;
if (*p++ == 0) return len; else len++;
if ((((intptr_t)p) & (sizeof(long)-1)) == 0) goto loop;
if (*p++ == 0) return len; else len++;
if ((((intptr_t)p) & (sizeof(long)-1)) == 0) goto loop;
if (*p++ == 0) return len; else len++;
loop:
    for(;;) {
        unsigned x = *(unsigned*)s;
        if((x & 0xFF) == 0) return len;
        if((x & 0xFF00) == 0) return len + 1;
        if((x & 0xFF0000) == 0) return len + 2;
        if((x & 0xFF000000) == 0) return len + 3;
        s += 4, len += 4;
    }

By aligning the address at the beginning, you ensure the code will never perform memory accesses that span two pages.

Peter Kankowski, 15 years ago

Thank you, it's a nice idea! I would rewrite it with a fall-through switch (similar to Duff's device) to avoid the excess conditions:

int strlen_my(const char *s) {
    int len = 0;
    const char *p = s;
    switch ( (intptr_t)p & 3 ) {
        case 3:
            if (*p++ == 0)
               return p - s;
        case 2:
            if (*p++ == 0)
               return p - s;
        case 1:
            if (*p++ == 0)
               return p - s;
    }
 
    for(;;) {
        unsigned x = *(unsigned*)p;
        if((x & 0xFF) == 0) return p - s;
        if((x & 0xFF00) == 0) return p - s + 1;
        if((x & 0xFF0000) == 0) return p - s + 2;
        if((x & 0xFF000000) == 0) return p - s + 3;
        p += 4;
    }
}

The benchmark shows that this code is slower on short strings, but slightly faster on a long string because of aligned memory access. Thank you again for the idea.

Paolo Bonzini, 15 years ago

Can you find where is the break-even point WRT Agner's code now? Probably this is as good as you can be for reasonably sized strings; and safer than your original code which makes the comparison with your code apples-to-oranges, basically.

(Note that you are trading up to 3 conditions for always 2 conditions – or no change at all if the compiler decides it's not worth using a decision tree implementation of the switch statement – so not much is changing regarding the number of jumps).

k06a, 15 years ago

s+=4; => ((unsigned*)s)++;
// Increment is more usefull i think.

Dmitry Kostjuchenko, 15 years ago

I would propose aligning in while{} loop to avoid excessive conditions, this shall outperform fall-through switch as well, at least in my syntetic tests by +1.5%.

size_t __aux_strlen(const char *str)
{
    register size_t len = 0;
    // align pointer to 32 bits
    while ((((intptr_t)str) & (sizeof(unsigned int)-1)) != 0)
    {
        if (*str ++ == 0) 
            return len; 
        ++ len;
    }
    // search for 0
    for (;;)
    {
        unsigned int v = *(unsigned int *)str;
        if ((v & 0x000000FFU) == 0) return (len);
        if ((v & 0x0000FF00U) == 0) return (len + 1);
        if ((v & 0x00FF0000U) == 0) return (len + 2);
        if ((v & 0xFF000000U) == 0) return (len + 3);
        str += sizeof(unsigned int);
        len += sizeof(unsigned int);
    }
}

Peter Kankowski, 15 years ago

Thank you very much, Dmitry. Your function is also 29 bytes shorter when compiled with MSVC++ 2005.

Spanky, 15 years ago

"Four bytes are examined at once." Really? How odd. This _looks_ like it's supposed to be C code, so let's apply C rules:

1) A byte has a _minimum_ size, defined by its required range, of some 8 bits, but might be 32 or more - and often is. 2) An unsigned has a minimum size, defined by its required range, of 16 bits. 3) A string's length can exceed the range of an int, so the code can produce negative results. A string of length -17? Neat trick. 4) If I recall correctly, identifiers with names of the form str* are reserved.

Thus, we have no idea how many bytes are being examined, but we do know that as far as C (or, for that matter, C++) is concerned, the code is only required to access 16 bits at a time, and it is not required to work at all if the data isn't aligned properly for access as an unsigned.

Rather than worrying about code that's "fast", how about worrying about code that's actually correct, first? If there's no requirement that the code actually functions correctly, I can make a much faster version for you:

int mystrlen(const char *s ) {

return 17;

}

You'd be hard pressed to beat it's performance, and unlike your version, it is at least expected to work.

ace, 15 years ago

To Spanky: Which compiler actually uses "char" of more than 8 bits? Do you know of any? I don't mean "wchar_t", I mean "char".

(On another side, I believe that it would be possible to make a C compiler which would generate code for old machines which used 6 bits for char, had 24-bit words and were bit addressable. So I agree that using "char" and "unsigned" as presented here is not portable. Not to mention endianness. There should probably be some line in the article that mentiones it's all for x86 world.)

Peter Kankowski, 15 years ago

Spanky, thank you for the comments. I added "for x86 only" line. A generalized code for any processor would be too complex (see SSE2 optimised strlen as an example). The compilers might have a different size of unsigned (usually from 16 to 64 bits) and endianness, so you have to use #ifdefs for that.

3), 4) Thanks, I corrected this.