The new instructions
SSE 4.2 introduces four instructions (PcmpEstrI, PcmpEstrM, PcmpIstrI, and PcmpIstrM) that can be used to speed up text processing code (including strcmp, memcmp, strstr, and strspn functions).
Intel had published the description for new instruction formats, but no sample code nor high-level guidelines. This article tries to provide them.
MovDqU xmm0, dqword[str1] PcmpIstrI xmm0, dqword[str2], imm
PcmpIstrI is one of the new string-handling instructions comparing their operands. The first operand is always an SSE register (typically loaded with MovDqU). The second operand can be a memory location. Note the immediate operand, which consists of several bit fields controlling operation modes.
The heart of a string-processing instruction is the aggregation operation (immediate bits [3:2]).
Equal each (imm[3:2] = 10). This operation compares two strings (think of strcmp or memcmp). The result of comparison is a bit mask (1 if the corresponding bytes are equal, 0 if not equal). For example:
operand1 = "UseFlatAssembler" operand2 = "UsingAnAssembler" IntRes1 = 1100000111111111
Equal any (imm[3:2] = 00). The first operand is a character set, the second is a string (think of strspn or strcspn). The bit mask includes 1 if the character belongs to a set, 0 if not:
operand2 = "You Drive Me Mad", operand1 = "aeiouy" IntRes1 = 0110001010010010
Ranges (imm[3:2] = 01). The first operand consists of ranges, for example, "azAZ" means "all characters from a to z and all characters from A to Z":
operand2 = "I'm here because", operand1 = "azAZ" IntRes1 = 1010111101111111
Equal ordered (imm[3:2] = 11). Substring search (strstr). The first operand contains a string to search for, the second is a string to search in. The bit mask includes 1 if the substring is found at the corresponding position:
operand2 = "WhenWeWillBeWed!", operand1 = "We" IntRes1 = 000010000000100
After computing the aggregation function, IntRes1 can be complemented, expanded into byte mask or shrinked into index. The result is written into xmm0 or ECX registers. Intel manual explains these details well, so there is no need to repeate them here.
Other features of SSE 4.2
- The strings do not need to be aligned.
- The processor properly handles end-of-the-string case for zero-terminated strings and Pascal-style strings.
- You can use the instructions with Unicode characters, signed or unsigned bytes.
- Four aggregation operations can be used to implement a wide range of string-processing functions.
The following strcmp and strlen functions were written and published when the processors with SSE 4.2 support were not available yet. Later they were tested on real hardware and found to be correct.
; compile with FASM ; Immediate byte constants EQUAL_ANY = 0000b RANGES = 0100b EQUAL_EACH = 1000b EQUAL_ORDERED = 1100b NEGATIVE_POLARITY = 010000b BYTE_MASK = 1000000b ; ==== strcmp ==== strcmp_sse42: ; Using __fastcall convention, ecx = string1, edx = string2 mov eax, ecx sub eax, edx ; eax = ecx - edx sub edx, 16 STRCMP_LOOP: add edx, 16 MovDqU xmm0, dqword[edx] ; find the first *different* bytes, hence negative polarity PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH + NEGATIVE_POLARITY ja STRCMP_LOOP jc STRCMP_DIFF ; the strings are equal xor eax, eax ret STRCMP_DIFF: ; subtract the first different bytes add eax, edx movzx eax, byte[eax + ecx] movzx edx, byte[edx + ecx] sub eax, edx ret ; ==== strlen ==== strlen_sse42: ; ecx = string mov eax, -16 mov edx, ecx pxor xmm0, xmm0 STRLEN_LOOP: add eax, 16 PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH jnz STRLEN_LOOP add eax, ecx ret ; ==== strstr ==== strstr_sse42: ; ecx = haystack, edx = needle push esi push edi MovDqU xmm2, dqword[edx] ; load the first 16 bytes of neddle Pxor xmm3, xmm3 lea eax, [ecx - 16] ; find the first possible match of 16-byte fragment in haystack STRSTR_MAIN_LOOP: add eax, 16 PcmpIstrI xmm2, dqword[eax], EQUAL_ORDERED ja STRSTR_MAIN_LOOP jnc STRSTR_NOT_FOUND add eax, ecx ; save the possible match start mov edi, edx mov esi, eax sub edi, esi sub esi, 16 ; compare the strings @@: add esi, 16 MovDqU xmm1, dqword[esi + edi] ; mask out invalid bytes in the haystack PcmpIstrM xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK MovDqU xmm4, dqword[esi] PAnd xmm4, xmm0 PcmpIstrI xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY ja @B jnc STRSTR_FOUND ; continue searching from the next byte sub eax, 15 jmp STRSTR_MAIN_LOOP STRSTR_NOT_FOUND: xor eax, eax STRSTR_FOUND: pop edi pop esi ret
The first processor with SSE 4.2 support is Intel Core i7, followed by Intel Core i5.
- Intel SSE4 programming reference.
- comp.arch discussion about the new string-processing instructions.
- Optimizing strlen on processors without SSE4 support.
- A review of Intel Nehalem processor, which includes support for the text processing instructions.
Typo in Intel manual: on figure 5-1, "imm8[6:5]" near Optional boolean negation should be "imm8[5:4]".
Ten recent comments are shown below. Show all comments
Note that the inner loop in the above code gets slightly bigger in case GCC chooses e.g. %rbp or %r12 for %4. This can be avoided by explicitly choosing a register, e.g.
"d"(ct) for RDX - at the expense of less freedom in register allocation
Thank you very much for sharing this! Fitting the loop into 16 bytes is a great idea; you have done a lot of work on the lowest level to achieve this.
You are right that equality is tested more often than the alphabetic order, but the comparison to zero is often missed:
In my programs, I use non-standard strcmpeq/memcmpeq functions, which return true if the strings are equal and false if not. It's certainly a matter of taste, and you might prefer the standard behavior (returning zero for equal strings).
It may be better to implement these functions using intrinsics, that way the compiler is better able to choose the right registers:
I did not manage to get gcc to emit instructions which use a register instead of a direct constant - it could be that the compiler believes it performs worse...
(NB am using "asm goto" feature in the above code, it interacts poorly with optimization sometimes because gcc does not know it depends on the flags status)
Comment on the SSE4.2 strlen() implementation: it actually performs much worse than the following SSE2 implementation (inspired by Agner Fog's implementation):
Measurements using Agner Fog's testPMC routines on a Core i7, calling 100000 * strlen() for 4 test strings:
C++ inlined version of the above code:
So even though the SSE4 strlen() code above contains far less instructions, these translate into many uops, resulting in more CPU cycles
Correction: second measurement above is for the SSE4 version of course
Sorry for the late answer. My results on Core i5:
So you version is the winner for long strings. You may be interested to look at a strchr implementation using SSE2. In comparison with it, you carefully avoided the branch for the first misaligned chunk. Nice job!
Peter, related but a bit of the topic, have you ever went about optimizing pattern matching with a mask?
Basically the common byte search with wildcards that typically looks like this:
There the "??" represent the variable bytes that you want to skip.
There are many ways of representing this kind of data that could be baked beforehand.
One could have an array of the same size bytes using flags '01' for keep/compare and '00' to skip, or sort of run length encoding, etc.; and of course not dealing with the pattern as a string (no need for that overhead).
I have this idea that one could use SSE instructions with some sort of mask, using bit operation, one might be able to process many pattern bytes at the time.
You already described it. Instead of 01 for keep/compare, use 0xFF. So in the mask, you will have 0xFF for each byte you want to compare and 0 for each byte you want to skip. Then run PAND instruction between the loaded bytes and the mask. After PAND, all skipped bytes will turn to zero. In the pattern, you should have zeros for skipped bytes, so that they will be equal.
Just recently came across this thread.
There is one issue with the pcmpistri (and most SSE-based) algorithms- they *always* fetch 16 bytes even if the zero byte is among them. In the (admittedly very) rare case where some of those bytes beyond the zero-terminating byte cross an MMU page boundary, it is possible to get a general protection fault.
Granted, the possibility is quite low. However, you would not want to use such an algorithm for a mission critical application (say, nuclear reactor or life-critical stuff) unless you can guarantee that there are at least 15 bytes of real memory after the zero byte.
Note that as memory is only protected in page sized units, as long as you ensure proper alignment before you start using SSE/AVX instructions you will be safe. But yes, any strcmp/strlen style function that uses SSE/AVX/etc. without a buffer size parameter, must first read in smaller units until it gets to a 16/32 byte aligned address, after which it can start using SSE/AVX safely without having to worry about faulting even if it reads past the end of the string.