Disclaimer: The strlen function rarely lies on the critical path of the program, and if it is, you should store the string length in an integer variable (Pascal-style strings). This article should be viewed as an exercise in code optimization, not as a recommendation for everyday programming.
Vectorized strlen
A usual implementation of strlen function scans the string byte-by-byte looking for terminating zero. For example:
size_t strlen(const char *s) {
const char *start = s;
while(*s)
s++;
return s - start;
}
It's small and easy, but not very fast. That is how we can improve it:
// for x86 only
size_t my_strlen(const char *s) {
size_t len = 0;
for(;;) {
unsigned x = *(unsigned*)s;
if((x & 0xFF) == 0) return len;
if((x & 0xFF00) == 0) return len + 1;
if((x & 0xFF0000) == 0) return len + 2;
if((x & 0xFF000000) == 0) return len + 3;
s += 4, len += 4;
}
}
Four bytes are examined at once. The program reads a double word from memory, extracts each of its bytes by ANDing with a mask, and compares the bytes with zero. That is what Agner Fog calls "vector operations in general purpose registers".
Problems and solutions
Warning: this function will crash if an non-readable memory page is located right after the end of the string. The simplest way to prevent this is to allocate 3 additional bytes at the end of string.
The dwords may be unaligned, but x86 architecture allows access to unaligned data. For small strings, the alignment will take more time than the penalty of unaligned reads.
The code is not portable: you will have to add another 4 conditions if you use a 64-bit processor. For big-endian architectures, the order of conditions should be reversed.
Agner Fog made another strlen function for his tutorials (file optimizing_assembly.pdf, chapter 13.5). The idea is the same, but he uses clever math tricks to avoid branches inside the loop.
Test conditions and results
The functions were tested on several short strings (words in Gettysburg address) and on a long string (Lewis Carroll's Jabberwocky). The RDTSC instruction was used for measurements.
Function | Short strings | The long string |
---|---|---|
strlen by Microsoft | 1471 | 2104 |
strlen_my | 1019 | 1471 |
strlen by Agner Fog | 1279 | 1056 |
Agner Fog's function is faster for the long string, while strlen_my performs better on the short strings.
Download the test program (6 Kb)
Related articles
SSE2 optimised strlen by Dmitry Kostjuchenko
Implementing strcmp, strlen, and strstr using SSE 4.2 instructions by Peter Kankowski
68 comments
The easy way to achieve a) which is repeatable (in a way that I can download your program, run, for example, on my P4 and have the same input data as you) is to use RND generator, seeding it with a known seed before you start generating data. Then you generate a lot of data of some given average length (for example, 50 MB of data) and then measure with (you guessed it) GetPerformanceCounter. Such tests are repeatable on different processors and can show you surprising differencies between different processors -- what's very fast on one can be very bad on another (as a small example -- the code with more function calls in critical loop is much faster on P-M than on P-4). Moreover, the tests lie less compared to real use since you don't test it always with so little data that always everything happens only inside of the cache or inside of the processor.
But the whole example with strlen is very unrealistic. I can't imagine a program in which strlen will be a hotspot (i.e., a piece of code where the program spends most of its execution time). Though, you can construct a test to measure the duration of one or several strlen operations. That is what I call synthetic measurements. Agner Fog uses a similar technique to measure the duration of single operations (such as ADD or MUL) for his instruction tables. You can also extract a short and frequently called function from your program and optimize it singly, using synthetic tests to measure your achievements.
Synthetic tests are usually repeated 10-20 times. The first runs show longer durations because of uncached data access or branch mispredictions. These effects are hard to reproduce; they are also insignificant because if the function in called 4000 times, it doesn't matter how much time the first 5 runs will take, but it does matter how will do the next 3995 runs.
Sometimes, a task switch or an interrupt will occur during the test. It looks like that:
It's hard to repeat, too, and you have to neglect such interrupts.
For synthetic tests, you should use RDTSC, because GetPerformanceCounter is too imprecise for short time intervals. Agner Fog describes the method in optimizing_assembly.pdf, chapter 18 "Measuring performance".
And about caching, it's hard to imagine that someone will use strlen for very large strings that will not fit in cache (if someone do this, he definitely should look for a better solution, e.g., storing the length of the string in a variable :). Strlen is typically used for short strings (path name, user name, etc.). Such data may be in L1 cache, in L2 cache, in RAM, or even in a swap file. It depends on data usage pattern and it's hard to reproduce if you have not a real program where strlen is a hotspot. That's why I used synthetic tests here.
I will repeat this again: both kinds of measurements are useful if used properly. Realistic tests are not as precise as synthetic tests, but they show you the effect that your optimization has on the overall program performance, and it's the most important thing.
Now when you program something like a substitute for a library function you usually have some other goals -- like making it worth to use instead of the existing library function on the common platforms and not only on the processor in your own computer. If you are using it as the substitute for the lib function, you are interested to know how it performs on average, not on specially tuned cases where branch mispredictions don't happen, because that's impossible to have in real usage -- if you don't have mispredictions, you probably don't need such a piece of code anyway.
And you're of course 100% right that storing the length with the string can be the right solution for cases where program does too much strlens.
Still, when you make something, try to do some tests on different processors (like P-4 and P-M), measuring real time (in seconds) to do something that looks close enough to the real job -- I expect that you are definitely going to have some surprises. And for speed optimizations what's really counting is only how fast you can process a lot of data (and usually on more than one computer). A little can be processed by anything anyway.
А ещё непонятно, откуда взята "Microsoft’s strlen". Т.к. даже в отладочной сборке используется crt-код из strlen.asm, содержимое которого отличается лишь сигнатурой функции (в VS 7.1 и 8).
But, I must add, this optimized function will be used only if you set /Oi- compiler option (Enable Intrinsics = No). By default, if you use /O2 (Maximize Speed) or /Ox (Full Optimization), the /Oi option will be set on, and you have to turn it off manually (add /Oi- to Project Properties > Command Line > Additional Options).
Or, just use my function :).
Optimizing str() functions is a waste of time. They are used because they are simple not because they are efficient. If you want efficiency at any size you take the length once and pass it among many mem() functions. Optimally you don't ever calculate the length because your string supplier provides you with one.
The FastCode library for Delphi has highly optimized strlen functions that get around this - even with the 128 bit SSE2 versions. The code for these looks something like:
lea ecx, [eax+16]
test ecx, $ff0
jz @@NearPageEnd
@@WithinPage:
...
@@NearPageEnd:
mov edx, eax
@@Loop:
cmp byte ptr [eax], 0
je @@SetResult2
add eax, 1
test eax, 15
jnz @@Loop
jmp @@WithinPage
Taking advantage of the fact that a pages are aligned on 4096 byte boundaries (by my understanding).
The easiest way to prevent this error is to allocate 3 extra bytes for each string and initialize them to zeros (so, a string will have 4 terminating zero bytes instead of one). Chris said about this, too: 4 == sizeof(int).
In this case, strcmp, strcpy, and other functions can also operate on DWORDs (or ints), so they will be fast and simple. No special processing for the remaining 1-3 bytes at the end of the string will be needed.
IMHO, it's stupid to use SSE2 to optimize strlen. If your string is so long, you should store its length somewhere and not use the loop over the string to determine its length. Strlen is basically an O(N) loop, no matter how much you optimize it.
"You might be interested to know that such detection of null bytes in words can be done in 3 or 4 instructions on almost any hardware (nay even in C). (Code that follows relies on x being a 32 bit unsigned (or 2's complement int with overflow ignored)...)
Then if e is an expression without side effects (e.g. variable)
is nonzero iff the value of e has a null byte. (One can view this as explicit programming of the Manchester carry chain present in many processors which is hidden from the instruction level).
On Acorn's ARM risc machine using this feature as Brian Case suggests speeded up dhrystone by 20-25%"
I don't agree with all the claims posted.
First, your methods of measuring the efficiency of the functions are saturated by few uncounted factors:
1. When measuring something we'll get more accurate results if tests are run for longer time
2. Context switches are unlikely to occur frequently under Windows when you are running in realtime priority thread
3. Your code assumes that it is executing on single core machine - you didn't set affinity mask, so you could even read another core's time stamp counter (currently no CPU in existance has Invariant TSC's)
4. Main optimization of your code, as you'll see, occurs with proper data alignment and removing unnecessary "len" variable
5. Another obstacle in measurement is printf function in the middle of the critical loop, that can yield weird timings
So now, your function looks like this:
__declspec(align(16)) int __stdcall strlen_yours(const char *s)
{
__declspec(align(16)) const char *r = s;
for(;;)
{
__declspec(align(16)) unsigned x = *((unsigned *)s);
if((x & 0x000000FF) == 0) return (int)(s - r);
if((x & 0x0000FF00) == 0) return (int)((s - r) + 1);
if((x & 0x00FF0000) == 0) return (int)((s - r) + 2);
if((x & 0xFF000000) == 0) return (int)((s - r) + 3);
s += 4;
}
}
And this "traditional" assembly function like this:
__declspec(align(16)) int __stdcall strlen_traditional(const char* s)
{
__declspec(align(16)) int result = 0;
__asm
{
xor eax, eax
mov edi, [s]
mov ecx, eax
dec ecx
repne scasb
not ecx
dec ecx
mov eax, ecx
mov [result], eax
}
return result;
}
I tested these with Microsoft C++ Compiler and Intel C++ 10.0.027, results are... How to tell... Weird?
Believe me, I'm also surprised by the results... It seems that Intel knows a little bit more about it's own architecture.
Five times faster code execution, even on the AMD CPU is a little bit too much, but I'm pretty sure that these results are correct.
I measured the number of iterations performed during specified period of time, per function.
If you agree, this way you get more accurate results - how many times you can perform the same function in limited time interval.
Contrary to some assumptions, repXX prefixes are still in the game, and obviously are fastest in the West :)
Perhaps on some other architecture, it will not be the case, but let's collect some results first, so we can compare.
The following tests were performed on AMD x64 X2 5600+ CPU, with 1024KB of L2 cache per core.
You can download modified source code and the binaries here:
http://www.borghesia.com/code/peter.zip
Command line output:
C:\Work\Kanowski\Release>Kanowski_Microsoft.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x1
Env: Thread ideal processor 0
strlen - short strings (x86):
Finished Default - with 25038911 iterations over 12000 ms
Finished Peter's original - with 53662840 iterations over 12000 ms
Finished Peter's optimized - with 82889344 iterations over 12000 ms
Finished Agner Fog's assembly - with 110818331 iterations over 12000 ms
Finished Traditional assembly - with 133129466 iterations over 12000 ms
strlen - long strings (x86):
Finished Default - with 16659936 iterations over 12000 ms
Finished Peter's original - with 47309270 iterations over 12000 ms
Finished Peter's optimized - with 77931448 iterations over 12000 ms
Finished Agner Fog's assembly - with 108467888 iterations over 12000 ms
Finished Traditional assembly - with 124982814 iterations over 12000 ms
C:\Work\Kanowski\Release>Kanowski_Intel.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x1
Env: Thread ideal processor 0
strlen - short strings (x86):
Finished Default - with 217500254 iterations over 12000 ms
Finished Peter's original - with 441657769 iterations over 12000 ms
Finished Peter's optimized - with 668833221 iterations over 12000 ms
Finished Agner Fog's assembly - with 701593142 iterations over 12000 ms
Finished Traditional assembly - with 724449336 iterations over 12000 ms
strlen - long strings (x86):
Finished Default - with 228646631 iterations over 12000 ms
Finished Peter's original - with 454222319 iterations over 12000 ms
Finished Peter's optimized - with 679860555 iterations over 12000 ms
Finished Agner Fog's assembly - with 710564707 iterations over 12000 ms
Finished Traditional assembly - with 727093980 iterations over 12000 ms
Cheers,
Vladimir
I have yet to test it as a native 64-bit application (now it ran in WOW64-mode).
D:\Documents and Settings\Administrator\Desktop\tests>Kanowski_Microsoft.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x1
Env: Thread ideal processor 0
strlen - short strings (x86):
Finished Default - with 14581667 iterations over 12000 ms
Finished Peter's original - with 34420379 iterations over 12000 ms
Finished Peter's optimized - with 53623397 iterations over 12000 ms
Finished Agner Fog's assembly - with 70715876 iterations over 12000 ms
Finished Traditional assembly - with 77313332 iterations over 12000 ms
strlen - long strings (x86):
Finished Default - with 16095626 iterations over 12000 ms
Finished Peter's original - with 32171382 iterations over 12000 ms
Finished Peter's optimized - with 48249741 iterations over 12000 ms
Finished Agner Fog's assembly - with 69516110 iterations over 12000 ms
Finished Traditional assembly - with 73369120 iterations over 12000 ms
D:\Documents and Settings\Administrator\Desktop\tests>Kanowski_Intel.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x1
Env: Thread ideal processor 0
strlen - short strings (x86):
Finished Default - with 46771442 iterations over 12000 ms
Finished Peter's original - with 93445100 iterations over 12000 ms
Finished Peter's optimized - with 140225969 iterations over 12000 ms
Finished Agner Fog's assembly - with 161522516 iterations over 12000 ms
Finished Traditional assembly - with 168330988 iterations over 12000 ms
strlen - long strings (x86):
Finished Default - with 46733163 iterations over 12000 ms
Finished Peter's original - with 93433140 iterations over 12000 ms
Finished Peter's optimized - with 140150691 iterations over 12000 ms
Finished Agner Fog's assembly - with 161667933 iterations over 12000 ms
Finished Traditional assembly - with 165528894 iterations over 12000 ms
Regards,
Madis
Update: The benchmarking code contained a bug, so these results are invalid. See the discussion below.
__declspec(align(16)) int __stdcall strlen_traditional(const char* s)
{
__declspec(align(16)) int result = 0;
__asm
{
xor eax, eax
mov edi, [s]
mov ecx, eax
dec ecx
repne scasb
not ecx
dec ecx
mov [result], ecx
}
return result;
}
And the second thing, I must apologise to Peter, his familly name is Kankowski, not Kanowski. I'm sorry Peter!
AC: And regarding your comment, you actually said nothing at all.
Besides, you don't have a single argument to back a single word you said. Show us anything you wrote / optimized (and please explain, if the code is broken, how the Peter's code optimized by me runs faster? - My proof is available). Secondly, strlen_Traditional is named after the original MSDOS code which used this construction, not my function :)
Unless you show us your talents, please stay away from the discussion, or perhaps I should - maybe this is not right place to share the opinion; Peter, please moderate.
ac probably stands for (A)nonymous (C)oward?
Madis
Update: The benchmarking code contained a bug, so these results are invalid. See the discussion below.
strlen - short strings (x64):
...
Finished Agner Fog's assembly - with 158704981 iterations over 12000 ms
...
strlen - long strings (x64):
...
Finished Agner Fog's assembly - with 169442729 iterations over 12000 ms
...
Obviously, he doesn't measure "strlenfunc". Since he doesn't know what he measures, I'll leave it to other readers to explain. Please don't communicate with Vladimir directly until he gets humbler.
Because all the measurements were done in an identical environment, they can be compared to each-other. Of course you can't relate them to any other results gotten from other OS/CPU/measuring algorithm configuration.
Madis
Instead:
Try:
And inform us what you can conclude.
(Bang my head x 3) me so blind :P
Anyway, on 64-bit, traditional ASM seems to be the slowest and default~peter~optimized_peter are on the same level. I'm still wondering why Agner's is so much slower than any of these three :S
Sorry again,
Madis
Sorry "ac" or "ace", you're right regarding the measurement, my mistake.
It would be so much easier if swallow your pride and next time simply show immediately what's wrong, that way we could make faster code faster.
But, results are still there (with Intel C++ optimizing compiler, Peter's function is the winner, leaving Agner Fog's and traditional implementation far behind (but we cannot compare I386 with native x64 code, so we have to wait for Agner's update), my optimization makes it even more successfull :)
Updated results (and code is on my site as well http://www.borghesia.com/code/peter.zip) looks like:
C:\Work\Kankowski\Release>Kankowski_Microsoft.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x2
Env: Thread ideal processor 0x1
Test1
Sanity check for Default yields 145
Sanity check for Peter's original yields 145
Sanity check for Peter's optimized yields 145
Sanity check for Agner Fog's assembly yields 145
Sanity check for Traditional assembly yields 145
Test2
Sanity check for Default yields 926
Sanity check for Peter's original yields 926
Sanity check for Peter's optimized yields 926
Sanity check for Agner Fog's assembly yields 926
Sanity check for Traditional assembly yields 926
strlen - short strings (x86):
Finished Traditional assembly - with 22375081 iterations over 12000 ms
Finished Agner Fog's assembly - with 29198690 iterations over 12000 ms
Finished Peter's optimized - with 28557076 iterations over 12000 ms
Finished Peter's original - with 28534699 iterations over 12000 ms
Finished Default - with 24722537 iterations over 12000 ms
strlen - long strings (x86):
Finished Traditional assembly - with 16307313 iterations over 12000 ms
Finished Agner Fog's assembly - with 30159599 iterations over 12000 ms
Finished Peter's optimized - with 30248880 iterations over 12000 ms
Finished Peter's original - with 30273804 iterations over 12000 ms
Finished Default - with 16467480 iterations over 12000 ms
C:\Work\Kankowski\Release>Kankowski_Intel.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x2
Env: Thread ideal processor 0x1
Test1
Sanity check for Default yields 145
Sanity check for Peter's original yields 145
Sanity check for Peter's optimized yields 145
Sanity check for Agner Fog's assembly yields 145
Sanity check for Traditional assembly yields 145
Test2
Sanity check for Default yields 926
Sanity check for Peter's original yields 926
Sanity check for Peter's optimized yields 926
Sanity check for Agner Fog's assembly yields 926
Sanity check for Traditional assembly yields 926
strlen - short strings (x86):
Finished Traditional assembly - with 22396916 iterations over 12000 ms
Finished Agner Fog's assembly - with 32896169 iterations over 12000 ms
Finished Peter's optimized - with 225898470 iterations over 12000 ms
Finished Peter's original - with 221394773 iterations over 12000 ms
Finished Default - with 221423927 iterations over 12000 ms
strlen - long strings (x86):
Finished Traditional assembly - with 16335173 iterations over 12000 ms
Finished Agner Fog's assembly - with 37920384 iterations over 12000 ms
Finished Peter's optimized - with 224262402 iterations over 12000 ms
Finished Peter's original - with 221292361 iterations over 12000 ms
Finished Default - with 222935383 iterations over 12000 ms
Regarding the suspicious too good Intel results, I've found out that MSVC++ doesn't properly handle dead code even if it is told so... Grmph #@%$!!!
These are final results - Agner's function is much faster than any other even with "Optimizing compilers". Peter's function is much faster when completely eliminated :(
C:\Work\Kanowski\Release>Kankowski_Intel.exe
Env: Process affinity mask succeeded
Env: PriorityBoost succeeded
Env: Thread affinity mask 0x2
Env: Thread ideal processor 0x1
Test1
Sanity check for Default yields 145
Sanity check for Peter's original yields 145
Sanity check for Peter's optimized yields 145
Sanity check for Agner Fog's assembly yields 145
Sanity check for Traditional assembly yields 145
Test2
Sanity check for Default yields 926
Sanity check for Peter's original yields 926
Sanity check for Peter's optimized yields 926
Sanity check for Agner Fog's assembly yields 926
Sanity check for Traditional assembly yields 926
strlen - short strings (x86):
Finished Traditional assembly - with 22266431 iterations over 12000 ms
Finished Agner Fog's assembly - with 32086429 iterations over 12000 ms
Finished Peter's optimized - with 26472001 iterations over 12000 ms
Finished Peter's original - with 22154576 iterations over 12000 ms
Finished Default - with 27178874 iterations over 12000 ms
strlen - long strings (x86):
Finished Traditional assembly - with 16251249 iterations over 12000 ms
Finished Agner Fog's assembly - with 38119397 iterations over 12000 ms
Finished Peter's optimized - with 24963566 iterations over 12000 ms
Finished Peter's original - with 21281613 iterations over 12000 ms
Finished Default - with 18453774 iterations over 12000 ms
1. Why to buy and use "optimizing compilers" when manually written code in assembly for simplest functions is still few times faster?
What about little bit more complex functions?
2. Proper alignment is obviously crucial nowdays.
3. Contrary to some manuals, Agner Fog's function performs better. Watch out.
4. Peter's code should run fastest according to todays CPU manufacturers' "best practices" (search for it on developer.intel.com and developer.amd.com), disregarding the fact that it needs trailing bytes in order to prevent possible access violations. But it is not.
5. Bad news are that, by my best guess, instruction pairing (as you can see from Agner's function) has more impact on our code than we ever imagined. That leads us to loose-loose situation where both assembly language written functions and compiled ones are behaving few times slower if they doesn't obey these rules. On the other side, these rule slightly differs regarding the CPU manufacturer and CPU type which leads us into even deeper abyss.
As AC noted, your benchmarking code has some problems:
> When measuring something we'll get more accurate results if tests are run for longer time
That's not always true. You can make very precise measurements with RDTSC, so you do not need to repeat the tests for short and fast functions (see Agner Fog's manual - optimizing_assembly.pdf, chapter "Measuring performance").
> __declspec(align(16))
You do not need to align every function, the compiler does it automatically (look at your executable under disassembler). The excessive alignment of variables does not provide you any benefits (test it with and without alignment and see the difference).
> Your code assumes that it is executing on single core machine
Yes, your code is better in this respect.
By x86 calling conventions, 64-bit values are returned in EDX:EAX, so you do not need to copy them to memory and back. Again: look at your code under disassembler or use listing files generated by compiler.
BTW, MSVC++ has an intrinsic function __rdtsc, which you can use (look for "__rdtsc" in MSDN).
What is the purpose of dwCounter variable and GetRDTSC() measurements in your code?
Finally, you call GetTickCount() on each iteration of the inner loop. Win32 API calls are "heavy"; such call may take more time than the strlen() function itself. The better way to measure throughput:
// pseudocode, not tested
UINT REPEAT = 1000;
UINT64 start = GetRDTSC();
for(UINT i = 0; i < REPEAT; i++)
tested_func();
UINT64 time = GetRDTSC() - start;
UINT64 throughput = ((UINT64)REPEAT * TSC_frequency) / time;
Well, Vladimir, your benchmarks are not perfect (neither are mine), but you have a lot of interesting ideas. Feel free to report them.
Agner Fog's code is faster than your optimized version on Core 2, but slower on Pentium M (on both your and my benchmarks). It may be due to longer pipeline in Core 2 or due to some other microarchitectural detail. It would be interesting to see the results on Pentium 4. Does anybody here have P4?
Madis, thank you very much for results on Core 2 and 64-bit mode.
I still don't know which ideas of Vladimir are relevant except of "you can use end - begin instead od maintaining length". Almost everything else he got plain dead wrong.
Thank You!
Madis
You missed an obvious optimization. You can detect a null in a four-byte word like:
The advantage this approach has is fewer jumps- and it uses only simple (1-clock) instructions, and 2 registers- allowing you to unroll this loop 2 or 3 times and give the execution units plenty to do.
Regarding invalid accesses, one possibility is this (sorry for the goto, so much easier to write it this way):
By aligning the address at the beginning, you ensure the code will never perform memory accesses that span two pages.
Thank you, it's a nice idea! I would rewrite it with a fall-through switch (similar to Duff's device) to avoid the excess conditions:
The benchmark shows that this code is slower on short strings, but slightly faster on a long string because of aligned memory access. Thank you again for the idea.
Can you find where is the break-even point WRT Agner's code now? Probably this is as good as you can be for reasonably sized strings; and safer than your original code which makes the comparison with your code apples-to-oranges, basically.
(Note that you are trading up to 3 conditions for always 2 conditions – or no change at all if the compiler decides it's not worth using a decision tree implementation of the switch statement – so not much is changing regarding the number of jumps).
I would propose aligning in while{} loop to avoid excessive conditions, this shall outperform fall-through switch as well, at least in my syntetic tests by +1.5%.
Thank you very much, Dmitry. Your function is also 29 bytes shorter when compiled with MSVC++ 2005.
"Four bytes are examined at once." Really? How odd. This _looks_ like it's supposed to be C code, so let's apply C rules:
1) A byte has a _minimum_ size, defined by its required range, of some 8 bits, but might be 32 or more - and often is. 2) An unsigned has a minimum size, defined by its required range, of 16 bits. 3) A string's length can exceed the range of an int, so the code can produce negative results. A string of length -17? Neat trick. 4) If I recall correctly, identifiers with names of the form str* are reserved.
Thus, we have no idea how many bytes are being examined, but we do know that as far as C (or, for that matter, C++) is concerned, the code is only required to access 16 bits at a time, and it is not required to work at all if the data isn't aligned properly for access as an unsigned.
Rather than worrying about code that's "fast", how about worrying about code that's actually correct, first? If there's no requirement that the code actually functions correctly, I can make a much faster version for you:
int mystrlen(const char *s ) {
}
You'd be hard pressed to beat it's performance, and unlike your version, it is at least expected to work.
To Spanky: Which compiler actually uses "char" of more than 8 bits? Do you know of any? I don't mean "wchar_t", I mean "char".
(On another side, I believe that it would be possible to make a C compiler which would generate code for old machines which used 6 bits for char, had 24-bit words and were bit addressable. So I agree that using "char" and "unsigned" as presented here is not portable. Not to mention endianness. There should probably be some line in the article that mentiones it's all for x86 world.)
Spanky, thank you for the comments. I added "for x86 only" line. A generalized code for any processor would be too complex (see SSE2 optimised strlen as an example). The compilers might have a different size of unsigned (usually from 16 to 64 bits) and endianness, so you have to use #ifdefs for that.
3), 4) Thanks, I corrected this.
Another implementation of strlen() uses the same approach, outperforms run-time function by 120% on larger strings
if anyone needs I can paste a template that works with wchar_t and on big endian machine.
Thanks Peter fro your idea.
forgot to mention, this is for 64 bit compiler (size_t = 8)
Of course nothing beats SSE 4.2 PCMPxSTRx instructions. Finally they introduced an instruction that can significantly speed up all memxxx(), strxxx() functions.
For those who do not like resort to an assembler, here is an strlen() that uses microsoft C compiler intrinsic.
It outperforms run time library strlen() 500 percent.
Obviously all other strxxx() family of functions can be easily implemented
For example strchr()
Also performance in comparison with run time library function increase is very significant
You don't need sse4 to get this speed-up; pcmpeq + movemask + bfs do the job quite well.
Here's an sse2 equivalent, with attention to using aligned loads. It's gcc 4.4.
Pardon my dense style and presumed #includes
Thank you. Agner Fog uses the same approach for strlen in his asmlib. Unfortunately, even after small corrections your code fails the tests.
(crashes on the first xmload with access violation; tested under MSVC++ and GCC)Gah! Sorry. To save space I compressed xmload/xmatch from static inline functions into #define's.
The fix is:
Also, I'm fascinated that
works, as opposed to
With the compile switches I use, that doesn't get past a compile error:
"comparison with string literal results in unspecified behavior"
The above logic can be extended to a strstr that scans for a match on the first two chars of the pattern.
Sorry for not working out the MSVC details here.
The above runs at about 10-15 times the speed of the glibc 4.4.5 strstr for any pattern less than 32 bytes.
This includes edge cases like the match occurring at offset 0 of the target string.
Note that glibc strlen (and hence glibc strstr) benefits from the same strlen that in fact uses the basic above idea.
There's a speed drop to 2-3 times the speed of the glibc strstr at 32 bytes ... still chewing on why.
mischasan.wordpress.com for the things I've been finding.
And, umm, scratch the -O0 in the above "cc" line if you want to see the 10-15 times speedup :-)
I'm sorry, it certainly should be strcmp. The two-char strstr is a great idea. I've subscribed to your blog.
On x86 and x86_64 (on core I7), there is no penalty for any misaligned memory access within a cache line, and for misaligned accesses crossing a cache line, the penalty is one cycle. You might as well consider misaligned accesses completely free. AMD architectures also have negligible penalties for misalignment.
I think I know why: I was doing performance analysis once and I enabled alignment checking in the eflags register and setup my debugger to catch the alignment traps and record the stack traces. To my surprise, I was getting millions of misaligned accesses throughout windows code and in the CRT. My theory is, Intel engineers did a similar investigation and found misaligned accesses to be happening so often, that they realized that they needed to make it super fast. In Core I7, the handling of misaligned accesses is absolutely amazing.
The first Intel processors x86 with no penalty for some misaligned memory access that I know of were some Core 2 models on 45 nm. The ones on 65 nm were significantly slower.
you could eliminate the len+=4 from the loop by using a local char* instead of s.
Brad, thank you very much for noticing. Your version has one instruction less in the loop. However, len += 4 can be executed in parallel with s += 4 on a superscalar processor. As far as I remember, both versions have the same speed for this reason.
Hi Peter,
I've found that standard strlen works faster than your implementation with Ubuntu Linux 12.4 x86_64 (g++-4.6.3).
I disassembled `libc.so` and saw that strlen function checks the cpu features and calls one of the functions: __strlen_sse2_pminub, __strlen_sse2, __strlen_sse42 or __strlen_sse2_no_bsf.
Sometimes standard `strlen` function works 8x times faster (with the long string).
Hi Denis,
thank you for noticing. My friend Dmitry wrote a faster SSE2 implementation:
http://www.strchr.com/sse2_optimised_strlen
Peter, I've seen the sse2 implementation by Dmitry.
For the short string your implementation is the fastest (1.4x faster than the glibc's implementation and 1.5x faster than the Dmitry's one).
For the long string your implementation is the slowest (7.5x slower than the glibc's and 4.3x slower than the Dmitry's).
So the fastest implementation is the `standard` glibc's one.
Ubuntu 12.04 is `LTS` and contains 2 yrs. old version of `glibc`. It would be really interesting to test the newest version of `strlen`.
FWIW, here is an implementation that prevents crossing MMU pages using SSE instructions. String addresses are passed in RSI and RDI.
1st check if the start of your string is sizeof(unsigned) aligned.
If not check the first few bytes that are not aligned one by one.
Then check the aligned unsigned.
With this, no need to put an extra 3 bytes at the end as you will never hit a page that you can’t read, hence no crash.
And your code will be faster as you will read aligned unsigned values