Implementing strcmp and strlen using SSE 4.2 instructions
Using new Intel Core i7 instructions to speed up string manipulation.
The new instructions
SSE 4.2 introduces four instructions (PcmpEstrI, PcmpEstrM, PcmpIstrI, and PcmpIstrM) that can be used to speed up text processing code (including strcmp, memcmp, strstr, and strspn functions).
Intel had published the description for new instruction formats, but no sample code nor high-level guidelines. This article tries to provide them.
MovDqU xmm0, dqword[str1] PcmpIstrI xmm0, dqword[str2], imm
PcmpIstrI is one of the new string-handling instructions comparing their operands. The first operand is always an SSE register (typically loaded with MovDqU). The second operand can be a memory location. Note the immediate operand, which consists of several bit fields controlling operation modes.
Aggregation operations
The heart of a string-processing instruction is the aggregation operation (immediate bits [3:2]).
Equal each (imm[3:2] = 10). This operation compares two strings (think of strcmp or memcmp). The result of comparison is a bit mask (1 if the corresponding bytes are equal, 0 if not equal). For example:
operand1 = "UseFlatAssembler" operand2 = "UsingAnAssembler" IntRes1 = 1100000111111111
Equal any (imm[3:2] = 00). The first operand is a character set, the second is a string (think of strspn or strcspn). The bit mask includes 1 if the character belongs to a set, 0 if not:
operand2 = "You Drive Me Mad", operand1 = "aeiouy" IntRes1 = 0110001010010010
Ranges (imm[3:2] = 01). The first operand consists of ranges, for example, "azAZ" means "all characters from a to z and all characters from A to Z":
operand2 = "I'm here because", operand1 = "azAZ" IntRes1 = 1010111101111111
Equal ordered (imm[3:2] = 11). Substring search (strstr). The first operand contains a string to search for, the second is a string to search in. The bit mask includes 1 if the substring is found at the corresponding position:
operand2 = "WhenWeWillBeWed!", operand1 = "We" IntRes1 = 000010000000100
After computing the aggregation function, IntRes1 can be complemented, expanded into byte mask or shrinked into index. The result is written into xmm0 or ECX registers. Intel manual explains these details well, so there is no need to repeate them here.
Other features of SSE 4.2
- The strings do not need to be aligned.
- The processor properly handles end-of-the-string case for zero-terminated strings and Pascal-style strings.
- You can use the instructions with Unicode characters, signed or unsigned bytes.
- Four aggregation operations can be used to implement a wide range of string-processing functions.
Implementation
Warning: the following code was written when the processors with SSE 4.2 support were not available yet, so it was not tested on real hardware. If you can test it, please post the results here.
; Immediate byte constants EQUAL_EACH = 1000b NEGATIVE_POLARITY = 010000b strcmp: ; Using __fastcall convention, ecx = string1, edx = string2 mov eax, ecx sub eax, edx ; eax = ecx - edx sub edx, 16 @@: add edx, 16 MovDqU xmm0, dqword[edx] PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH | NEGATIVE_POLARITY jc @F jnz @B ; the strings are equal xor eax, eax ret @@: ; subtract the first different bytes add eax, edx movzx eax, byte[eax + ecx] movzx edx, byte[edx + ecx] sub eax, edx ret strlen: ; ecx = string mov eax, -16 mov edx, ecx pxor xmm0, xmm0 @@: add eax, 16 PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH jnz @B add eax, ecx ret
The first processor with SSE 4.2 support is Intel Core i7.
Additional reading
- comp.arch discussion about the new string-processing instructions.
- Optimizing strlen on processors without SSE4 support.
- A review of Intel Nehalem processor, which will include support for the text processing instructions.
Typo in Intel manual: on figure 5-1, "imm8[6:5]" near Optional boolean negation should be "imm8[5:4]".
Discussion
It will not work now. You can read manuals, learn, and wait until you will be able to use your knowledge.
We buy these type processor for general usage, not for specific. If this trend continues we may see lots c library functions in the processor core. That makes processor pretty useless !