The new instructions
SSE 4.2 introduces four instructions (PcmpEstrI, PcmpEstrM, PcmpIstrI, and PcmpIstrM) that can be used to speed up text processing code (including strcmp, memcmp, strstr, and strspn functions).
Intel had published the description for new instruction formats, but no sample code nor high-level guidelines. This article tries to provide them.
MovDqU xmm0, dqword[str1]
PcmpIstrI xmm0, dqword[str2], imm
PcmpIstrI is one of the new string-handling instructions comparing their operands. The first operand is always an SSE register (typically loaded with MovDqU). The second operand can be a memory location. Note the immediate operand, which consists of several bit fields controlling operation modes.
Aggregation operations
The heart of a string-processing instruction is the aggregation operation (immediate bits [3:2]).
Equal each (imm[3:2] = 10). This operation compares two strings (think of strcmp or memcmp). The result of comparison is a bit mask (1 if the corresponding bytes are equal, 0 if not equal). For example:
operand1 = "UseFlatAssembler" operand2 = "UsingAnAssembler" IntRes1 = 1100000111111111
Equal any (imm[3:2] = 00). The first operand is a character set, the second is a string (think of strspn or strcspn). The bit mask includes 1 if the character belongs to a set, 0 if not:
operand2 = "You Drive Me Mad", operand1 = "aeiouy" IntRes1 = 0110001010010010
Ranges (imm[3:2] = 01). The first operand consists of ranges, for example, "azAZ" means "all characters from a to z and all characters from A to Z":
operand2 = "I'm here because", operand1 = "azAZ" IntRes1 = 1010111101111111
Equal ordered (imm[3:2] = 11). Substring search (strstr). The first operand contains a string to search for, the second is a string to search in. The bit mask includes 1 if the substring is found at the corresponding position:
operand2 = "WhenWeWillBeWed!", operand1 = "We" IntRes1 = 000010000000100
After computing the aggregation function, IntRes1 can be complemented, expanded into byte mask or shrinked into index. The result is written into xmm0 or ECX registers. Intel manual explains these details well, so there is no need to repeate them here.
Other features of SSE 4.2
- The strings do not need to be aligned.
- The processor properly handles end-of-the-string case for zero-terminated strings and Pascal-style strings.
- You can use the instructions with Unicode characters, signed or unsigned bytes.
- Four aggregation operations can be used to implement a wide range of string-processing functions.
Implementation
The following strcmp and strlen functions were written and published when the processors with SSE 4.2 support were not available yet. Later they were tested on real hardware and found to be correct.
; compile with FASM
; Immediate byte constants
EQUAL_ANY = 0000b
RANGES = 0100b
EQUAL_EACH = 1000b
EQUAL_ORDERED = 1100b
NEGATIVE_POLARITY = 010000b
BYTE_MASK = 1000000b
; ==== strcmp ====
strcmp_sse42:
; Using __fastcall convention, ecx = string1, edx = string2
mov eax, ecx
sub eax, edx ; eax = ecx - edx
sub edx, 16
STRCMP_LOOP:
add edx, 16
MovDqU xmm0, dqword[edx]
; find the first *different* bytes, hence negative polarity
PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH + NEGATIVE_POLARITY
ja STRCMP_LOOP
jc STRCMP_DIFF
; the strings are equal
xor eax, eax
ret
STRCMP_DIFF:
; subtract the first different bytes
add eax, edx
movzx eax, byte[eax + ecx]
movzx edx, byte[edx + ecx]
sub eax, edx
ret
; ==== strlen ====
strlen_sse42:
; ecx = string
mov eax, -16
mov edx, ecx
pxor xmm0, xmm0
STRLEN_LOOP:
add eax, 16
PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH
jnz STRLEN_LOOP
add eax, ecx
ret
; ==== strstr ====
strstr_sse42:
; ecx = haystack, edx = needle
push esi
push edi
MovDqU xmm2, dqword[edx] ; load the first 16 bytes of neddle
Pxor xmm3, xmm3
lea eax, [ecx - 16]
; find the first possible match of 16-byte fragment in haystack
STRSTR_MAIN_LOOP:
add eax, 16
PcmpIstrI xmm2, dqword[eax], EQUAL_ORDERED
ja STRSTR_MAIN_LOOP
jnc STRSTR_NOT_FOUND
add eax, ecx ; save the possible match start
mov edi, edx
mov esi, eax
sub edi, esi
sub esi, 16
; compare the strings
@@:
add esi, 16
MovDqU xmm1, dqword[esi + edi]
; mask out invalid bytes in the haystack
PcmpIstrM xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK
MovDqU xmm4, dqword[esi]
PAnd xmm4, xmm0
PcmpIstrI xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
ja @B
jnc STRSTR_FOUND
; continue searching from the next byte
sub eax, 15
jmp STRSTR_MAIN_LOOP
STRSTR_NOT_FOUND:
xor eax, eax
STRSTR_FOUND:
pop edi
pop esi
ret
The first processor with SSE 4.2 support is Intel Core i7, followed by Intel Core i5.
Additional reading
- Intel SSE4 programming reference.
- comp.arch discussion about the new string-processing instructions.
- Optimizing strlen on processors without SSE4 support.
- A review of Intel Nehalem processor, which includes support for the text processing instructions.
Typo in Intel manual: on figure 5-1, "imm8[6:5]" near Optional boolean negation should be "imm8[5:4]".

10 comments
It will not work now. You can read manuals, learn, and wait until you will be able to use your knowledge.
We buy these type processor for general usage, not for specific. If this trend continues we may see lots c library functions in the processor core. That makes processor pretty useless !
Hi Peter,
I want to share my last attempt to boost my old(from this summer) strstr-like C function at:
http://www.sanmayce.com/Railgun/index.html
Railgun_Quadruplet follows:
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Horspool algorithm [ // Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char. char * Railgun_Quadruplet (char * pbTarget, char * pbPattern, unsigned long cbTarget, unsigned long cbPattern) { char * pbTargetMax = pbTarget + cbTarget; register unsigned long ulHashPattern; unsigned long ulHashTarget; unsigned long count; unsigned long countSTATIC; // unsigned long countRemainder; /* const unsigned char SINGLET = *(char *)(pbPattern); const unsigned long Quadruplet2nd = SINGLET<<8; const unsigned long Quadruplet3rd = SINGLET<<16; const unsigned long Quadruplet4th = SINGLET<<24; */ unsigned char SINGLET; unsigned long Quadruplet2nd; unsigned long Quadruplet3rd; unsigned long Quadruplet4th; unsigned long AdvanceHopperGrass; long i; //BMH needed int a, j, bm_bc[ASIZE]; //BMH needed unsigned char ch; //BMH needed // unsigned char lastch, firstch; //BMH needed if (cbPattern > cbTarget) return(NULL); // Doesn't work when cbPattern = 1 // The next IF-fragment works very well with cbPattern>1, OBVIOUSLY IT MUST BE UNROLLED(but crippled with less functionality) SINCE either cbPattern=2 or cbPattern=3! if ( cbPattern<4) { // This IF makes me unhappy: it slows down from 390KB/clock to 367KB/clock for 'fast' pattern. This fragment(for 2..3 pattern lengths) is needed because I need a function different than strchr but sticking to strstr i.e. lengths above 1 are to be handled. pbTarget = pbTarget+cbPattern; ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1)); countSTATIC = cbPattern-2; for ( ;; ) { // The line below gives for 'cbPattern'>=1: // Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/543 // Karp_Rabin_Kaze_4_OCTETS performance: 372KB/clock /* if ( (ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1)) && !memcmp(pbPattern, pbTarget-cbPattern, (unsigned int)cbPattern) ) return((long)(pbTarget-cbPattern)); */ // The fragment below gives for 'cbPattern'>=2: // Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/546 // Karp_Rabin_Kaze_4_OCTETS performance: 370KB/clock if ( ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1) ) { count = countSTATIC; while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) { count--; } if ( count == 0) return((pbTarget-cbPattern)); } // The fragment below gives for 'cbPattern'>=2: // Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/554 // Karp_Rabin_Kaze_4_OCTETS performance: 364KB/clock /* if ( ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1) ) { count = countSTATIC>>2; countRemainder = countSTATIC % 4; while ( count && *(unsigned long *)(pbPattern+1+((count-1)<<2)) == *(unsigned long *)(pbTarget-cbPattern+1+((count-1)<<2)) ) { count--; } //if (count == 0) { // Disastrous degradation only from this line(317KB/clock when 1+2x4+2+1 bytes pattern: 'skillessness'; 312KB/clock when 1+1x4+2+1 bytes pattern: 'underdog'), otherwise 368KB/clock. while ( countRemainder && *(char *)(pbPattern+1+(countSTATIC-countRemainder)) == *(char *)(pbTarget-cbPattern+1+(countSTATIC-countRemainder)) ) { countRemainder--; } //if ( countRemainder == 0) return((long)(pbTarget-cbPattern)); if ( count+countRemainder == 0) return((long)(pbTarget-cbPattern)); //} } */ pbTarget++; if (pbTarget > pbTargetMax) return(NULL); } } else { //if ( cbPattern<4) if (cbTarget<961) // This value is arbitrary(don't know how exactly), it ensures(at least must) better performance than 'Boyer_Moore_Horspool'. { pbTarget = pbTarget+cbPattern; ulHashPattern = *(unsigned long *)(pbPattern); countSTATIC = cbPattern-1; //SINGLET = *(char *)(pbPattern); SINGLET = ulHashPattern & 0xFF; Quadruplet2nd = SINGLET<<8; Quadruplet3rd = SINGLET<<16; Quadruplet4th = SINGLET<<24; for ( ;; ) { AdvanceHopperGrass = 0; ulHashTarget = *(unsigned long *)(pbTarget-cbPattern); if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here, but 'AdvanceHopperGrass' must be calculated - it has a higher priority. count = countSTATIC; while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) { if ( countSTATIC==AdvanceHopperGrass+count && SINGLET != *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) AdvanceHopperGrass++; count--; } if ( count == 0) return((pbTarget-cbPattern)); } else { // The goal here: to avoid memory accesses by stressing the registers. if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) { AdvanceHopperGrass++; if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) { AdvanceHopperGrass++; if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;; } } } AdvanceHopperGrass++; pbTarget = pbTarget + AdvanceHopperGrass; if (pbTarget > pbTargetMax) return(NULL); } } else { //if (cbTarget<961) countSTATIC = cbPattern-2; /* Preprocessing */ for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern; for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1; /* Searching */ //lastch=pbPattern[cbPattern-1]; //firstch=pbPattern[0]; i=0; while (i <= cbTarget-cbPattern) { ch=pbTarget[i+cbPattern-1]; //if (ch ==lastch) //if (memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) OUTPUT(i); //if (ch == lastch && pbTarget[i] == firstch && memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) return(i); // Kaze: The idea(to prevent execution of slower 'memcmp') is borrowed from Karp-Rabin i.e. to perform a slower check only when the target "looks like". if (ch == pbPattern[cbPattern-1] && pbTarget[i] == pbPattern[0]) { count = countSTATIC; while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == *(char *)(&pbTarget[i]+1+(countSTATIC-count)) ) { count--; } if ( count == 0) return(pbTarget+i); } i+=bm_bc[ch]; } return(NULL); } //if (cbTarget<961) } //if ( cbPattern<4) } // ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Horspool algorithm ]I would be glad if other people can contribute in refining Railgun_Quadruplet, also to explore speed performance on different CPUs especially i3+ and AMD Opteron.
Regards
The Intel documentation for these instructions is utterly indecipherable. Your explanation is very clear and easy to understand. Thanks!