Using new Intel Core i7 instructions to speed up string manipulation.

Created by Peter Kankowski
Last changed
Contributors: Dap and Adam Messinger
Filed under Low-level code optimization

Share on social sitesReddit Digg Delicious Buzz Facebook Twitter

Implementing strcmp, strlen, and strstr using SSE 4.2 instructions

The new instructions

SSE 4.2 introduces four instructions (PcmpEstrI, PcmpEstrM, PcmpIstrI, and PcmpIstrM) that can be used to speed up text processing code (including strcmp, memcmp, strstr, and strspn functions).

Intel had published the description for new instruction formats, but no sample code nor high-level guidelines. This article tries to provide them.

MovDqU    xmm0, dqword[str1]
PcmpIstrI xmm0, dqword[str2], imm

PcmpIstrI is one of the new string-handling instructions comparing their operands. The first operand is always an SSE register (typically loaded with MovDqU). The second operand can be a memory location. Note the immediate operand, which consists of several bit fields controlling operation modes.

Aggregation operations

The heart of a string-processing instruction is the aggregation operation (immediate bits [3:2]).

Equal each (imm[3:2] = 10). This operation compares two strings (think of strcmp or memcmp). The result of comparison is a bit mask (1 if the corresponding bytes are equal, 0 if not equal). For example:

operand1 = "UseFlatAssembler"
operand2 = "UsingAnAssembler"
IntRes1  =  1100000111111111

Equal any (imm[3:2] = 00). The first operand is a character set, the second is a string (think of strspn or strcspn). The bit mask includes 1 if the character belongs to a set, 0 if not:

operand2 = "You Drive Me Mad", operand1 = "aeiouy"
IntRes1  =  0110001010010010

Ranges (imm[3:2] = 01). The first operand consists of ranges, for example, "azAZ" means "all characters from a to z and all characters from A to Z":

operand2 = "I'm here because", operand1 = "azAZ"
IntRes1  =  1010111101111111

Equal ordered (imm[3:2] = 11). Substring search (strstr). The first operand contains a string to search for, the second is a string to search in. The bit mask includes 1 if the substring is found at the corresponding position:

operand2 = "WhenWeWillBeWed!", operand1 = "We"
IntRes1  =  000010000000100

After computing the aggregation function, IntRes1 can be complemented, expanded into byte mask or shrinked into index. The result is written into xmm0 or ECX registers. Intel manual explains these details well, so there is no need to repeate them here.

Other features of SSE 4.2

  • The strings do not need to be aligned.
  • The processor properly handles end-of-the-string case for zero-terminated strings and Pascal-style strings.
  • You can use the instructions with Unicode characters, signed or unsigned bytes.
  • Four aggregation operations can be used to implement a wide range of string-processing functions.


The following strcmp and strlen functions were written and published when the processors with SSE 4.2 support were not available yet. Later they were tested on real hardware and found to be correct.

; compile with FASM
; Immediate byte constants
EQUAL_ANY	    = 0000b
RANGES		    = 0100b
EQUAL_EACH	    = 1000b
EQUAL_ORDERED	    = 1100b
BYTE_MASK	 = 1000000b

; ==== strcmp ====

  ; Using __fastcall convention, ecx = string1, edx = string2
  mov eax, ecx
  sub eax, edx ; eax = ecx - edx
  sub edx, 16

    add edx, 16
    MovDqU    xmm0, dqword[edx]
    ; find the first *different* bytes, hence negative polarity
    PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH + NEGATIVE_POLARITY


  ; the strings are equal
  xor eax, eax
  ; subtract the first different bytes
  add eax, edx
  movzx eax, byte[eax + ecx]
  movzx edx, byte[edx + ecx]
  sub eax, edx

; ==== strlen ====
  ; ecx = string
  mov eax, -16
  mov edx, ecx
  pxor xmm0, xmm0

    add eax, 16
    PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH

  add eax, ecx

; ==== strstr ====
  ; ecx = haystack, edx = needle

  push esi
  push edi
  MovDqU xmm2, dqword[edx] ; load the first 16 bytes of neddle
  Pxor xmm3, xmm3
  lea eax, [ecx - 16]

  ; find the first possible match of 16-byte fragment in haystack
    add eax, 16
    PcmpIstrI xmm2, dqword[eax], EQUAL_ORDERED


  add eax, ecx ; save the possible match start
  mov edi, edx
  mov esi, eax
  sub edi, esi
  sub esi, 16

  ; compare the strings
    add esi, 16
    MovDqU    xmm1, dqword[esi + edi]
    ; mask out invalid bytes in the haystack
    MovDqU xmm4, dqword[esi]
    PAnd xmm4, xmm0
    PcmpIstrI xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
    ja @B


  ; continue searching from the next byte
  sub eax, 15

  xor eax, eax

  pop edi
  pop esi

The first processor with SSE 4.2 support is Intel Core i7, followed by Intel Core i5.

Additional reading

Typo in Intel manual: on figure 5-1, "imm8[6:5]" near Optional boolean negation should be "imm8[5:4]".

Peter Kankowski
Peter Kankowski

About the author

Peter is the developer of Aba Search and Replace, a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP. He can be reached at kankowski@gmail.com.

Created by Peter Kankowski
Last changed
Contributors: Dap and Adam Messinger



How would I go about making these inline asm functions in MSVC?

Peter Kankowski,

As I already said, the processors with SSE 4.2 were not officially released yet. The compilers and assemblers only start to implement SSE4 support.

It will not work now. You can read manuals, learn, and wait until you will be able to use your knowledge.


lol thats pretty useless.


The link to Intel's documentation is dead, I think it should be this one : http://softwarecommunity.intel.com/userfiles/en-us/d9156103.pdf

Peter Kankowski,

Thank you, Dap, I updated the link. Intel changes the location of the manuals so often...


Intel provides c library functions with the hardware which is not a good thing.

We buy these type processor for general usage, not for specific. If this trend continues we may see lots c library functions in the processor core. That makes processor pretty useless !

Adam Messinger,

In your equal any example, will the first bit really be 1? The Y in "You Drive Me Mad" is uppercase whereas the y in "aeiouy" is lowercase. I don't see anything in the documentation about case-insensitive matching.

Peter Kankowski,

Thanks, it was an error. Mea culpa.

Georgi 'Sanmayce',

Hi Peter,

I want to share my last attempt to boost my old(from this summer) strstr-like C function at:


Railgun_Quadruplet follows:

// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Horspool algorithm [
// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Quadruplet (char * pbTarget,
     char * pbPattern,
     unsigned long cbTarget,
     unsigned long cbPattern)
    char * pbTargetMax = pbTarget + cbTarget;
    register unsigned long  ulHashPattern;
    unsigned long ulHashTarget;
    unsigned long count;
    unsigned long countSTATIC;
//  unsigned long countRemainder;
    const unsigned char SINGLET = *(char *)(pbPattern);
    const unsigned long Quadruplet2nd = SINGLET<<8;
    const unsigned long Quadruplet3rd = SINGLET<<16;
    const unsigned long Quadruplet4th = SINGLET<<24;
    unsigned char SINGLET;
    unsigned long Quadruplet2nd;
    unsigned long Quadruplet3rd;
    unsigned long Quadruplet4th;
    unsigned long  AdvanceHopperGrass;
    long i; //BMH needed
    int a, j, bm_bc[ASIZE]; //BMH needed
    unsigned char ch; //BMH needed
//    unsigned char lastch, firstch; //BMH needed
    if (cbPattern > cbTarget)
// Doesn't work when cbPattern = 1
// The next IF-fragment works very well with cbPattern>1, OBVIOUSLY IT MUST BE UNROLLED(but crippled with less functionality) SINCE either cbPattern=2 or cbPattern=3!
if ( cbPattern<4) { // This IF makes me unhappy: it slows down from 390KB/clock to 367KB/clock for 'fast' pattern. This fragment(for 2..3 pattern lengths) is needed because I need a function different than strchr but sticking to strstr i.e. lengths above 1 are to be handled.
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = ( (*(char *)(pbPattern))<<8 ) + *(pbPattern+(cbPattern-1));
        countSTATIC = cbPattern-2;
    for ( ;; )
        // The line below gives for 'cbPattern'>=1:
        // Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/543
        // Karp_Rabin_Kaze_4_OCTETS performance: 372KB/clock
        if ( (ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1)) && !memcmp(pbPattern, pbTarget-cbPattern, (unsigned int)cbPattern) )
        // The fragment below gives for 'cbPattern'>=2:
        // Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/546
        // Karp_Rabin_Kaze_4_OCTETS performance: 370KB/clock
        if ( ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1) ) {
         count = countSTATIC;
         while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) {
         if ( count == 0) return((pbTarget-cbPattern));
        // The fragment below gives for 'cbPattern'>=2:
	// Karp_Rabin_Kaze_4_OCTETS_hits/Karp_Rabin_Kaze_4_OCTETS_clocks: 4/554
	// Karp_Rabin_Kaze_4_OCTETS performance: 364KB/clock
        if ( ulHashPattern == ( (*(char *)(pbTarget-cbPattern))<<8 ) + *(pbTarget-1) ) {
         count = countSTATIC>>2;
         countRemainder = countSTATIC % 4;
         while ( count && *(unsigned long *)(pbPattern+1+((count-1)<<2)) == *(unsigned long *)(pbTarget-cbPattern+1+((count-1)<<2)) ) {
	 //if (count == 0) {  // Disastrous degradation only from this line(317KB/clock when 1+2x4+2+1 bytes pattern: 'skillessness'; 312KB/clock when 1+1x4+2+1 bytes pattern: 'underdog'), otherwise 368KB/clock.
         while ( countRemainder && *(char *)(pbPattern+1+(countSTATIC-countRemainder)) == *(char *)(pbTarget-cbPattern+1+(countSTATIC-countRemainder)) ) {
         //if ( countRemainder == 0) return((long)(pbTarget-cbPattern));
         if ( count+countRemainder == 0) return((long)(pbTarget-cbPattern));
        if (pbTarget > pbTargetMax)
} else { //if ( cbPattern<4)
if (cbTarget<961) // This value is arbitrary(don't know how exactly), it ensures(at least must) better performance than 'Boyer_Moore_Horspool'.
        pbTarget = pbTarget+cbPattern;
        ulHashPattern = *(unsigned long *)(pbPattern);
        countSTATIC = cbPattern-1;
    //SINGLET = *(char *)(pbPattern);
    SINGLET = ulHashPattern & 0xFF;
    Quadruplet2nd = SINGLET<<8;
    Quadruplet3rd = SINGLET<<16;
    Quadruplet4th = SINGLET<<24;
    for ( ;; )
	AdvanceHopperGrass = 0;
	ulHashTarget = *(unsigned long *)(pbTarget-cbPattern);
        if ( ulHashPattern == ulHashTarget ) { // Three unnecessary comparisons here, but 'AdvanceHopperGrass' must be calculated - it has a higher priority.
         count = countSTATIC;
         while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) {
	       if ( countSTATIC==AdvanceHopperGrass+count && SINGLET != *(char *)(pbTarget-cbPattern+1+(countSTATIC-count)) ) AdvanceHopperGrass++;
         if ( count == 0) return((pbTarget-cbPattern));
        } else { // The goal here: to avoid memory accesses by stressing the registers.
    if ( Quadruplet2nd != (ulHashTarget & 0x0000FF00) ) {
         if ( Quadruplet3rd != (ulHashTarget & 0x00FF0000) ) {
              if ( Quadruplet4th != (ulHashTarget & 0xFF000000) ) AdvanceHopperGrass++;;
	pbTarget = pbTarget + AdvanceHopperGrass;
        if (pbTarget > pbTargetMax)
} else { //if (cbTarget<961)
        countSTATIC = cbPattern-2;
    /* Preprocessing */
    for (a=0; a < ASIZE; a++) bm_bc[a]=cbPattern;
    for (j=0; j < cbPattern-1; j++) bm_bc[pbPattern[j]]=cbPattern-j-1;
    /* Searching */
    while (i <= cbTarget-cbPattern) {
       //if (ch ==lastch)
          //if (memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) OUTPUT(i);
          //if (ch == lastch && pbTarget[i] == firstch && memcmp(&pbTarget[i],pbPattern,cbPattern-1) == 0) return(i);  // Kaze: The idea(to prevent execution of slower 'memcmp') is borrowed from Karp-Rabin i.e. to perform a slower check only when the target "looks like".
          if (ch == pbPattern[cbPattern-1] && pbTarget[i] == pbPattern[0])
         count = countSTATIC;
         while ( count && *(char *)(pbPattern+1+(countSTATIC-count)) == *(char *)(&pbTarget[i]+1+(countSTATIC-count)) ) {
         if ( count == 0) return(pbTarget+i);
} //if (cbTarget<961)
} //if ( cbPattern<4)
// ### Mix(2in1) of Karp-Rabin & Boyer-Moore-Horspool algorithm ]

I would be glad if other people can contribute in refining Railgun_Quadruplet, also to explore speed performance on different CPUs especially i3+ and AMD Opteron.



The Intel documentation for these instructions is utterly indecipherable. Your explanation is very clear and easy to understand. Thanks!

Jeroen van Bemmel,

I took your code and produced an x86-64 version for GCC C/C++. By loading the 16 constant in a register, it is possible to fit the inner loop in 16 bytes of opcodes (latest gcc -O5) - a good match for the i7 instruction prefetcher. Moreover, I found that PCMPISTRI clears the upper bits of RCX - so it is not necessary to clear RCX upfront. The setup code also fits in 16 bytes.

A more optimal version (non-standard) could use:

"jc 2f      \n" // rax is non-zero unless cs==ct, in which case c=false
"xor %0, %0 \n" // Executed "for free" on i7 (during reg renaming)

at the end, to return a bool result (equal / not equal) rather than the difference between bytes. Many calls to strcmp() only test for equality anyway, and this short branch with xor costs (almost) nothing on i7

static inline int __strcmp(const char * cs, const char * ct)
        // see http://www.strchr.com/strcmp_and_strlen_using_sse_4.2
        __m128i anyXmm;
        long __res, rcx, anyR = 16;
                "sub            %5, %4                  \n"
                ".align 16                              \n"     // 0 filler bytes :)
                "1:\n"  // loop is 16 opcode bytes :)
                "add            %5, %4                  \n"
                "movdqu         (%4), %2                \n"     // Use any XMM, using register constraint "x"
                "pcmpistri      $0x18, (%4,%0), %2      \n"     // EQUAL_EACH(0x08)+NEGATIVE_POLARITY(0x10) 
                                                                // clears upper bits of RCX
                "ja 1b                                  \n"
                "jc 2f                                  \n"
                "xor %0, %0                             \n"
                "jmp 3f                                 \n"     // XXX Extra jump could be avoided in pure asm
                "add %4, %0                             \n"
                "movzxb (%0,%1), %0                     \n"     // Note: uses full RCX(%1)
                "movzxb (%4,%1), %4                     \n"
                "sub %4, %0                             \n"
        : "=r"(__res), "+c"(rcx), "=x"(anyXmm) : "0"(cs-ct), "r"(ct), "r"(anyR) );

        return (int) __res;
Jeroen van Bemmel,

Note that the inner loop in the above code gets slightly bigger in case GCC chooses e.g. %rbp or %r12 for %4. This can be avoided by explicitly choosing a register, e.g.

"d"(ct) for RDX - at the expense of less freedom in register allocation

Peter Kankowski,

Thank you very much for sharing this! Fitting the loop into 16 bytes is a great idea; you have done a lot of work on the lowest level to achieve this.

You are right that equality is tested more often than the alphabetic order, but the comparison to zero is often missed:

if (strcmp(str, "text"))

instead of

if (strcmp(str, "text") == 0)

In my programs, I use non-standard strcmpeq/memcmpeq functions, which return true if the strings are equal and false if not. It's certainly a matter of taste, and you might prefer the standard behavior (returning zero for equal strings).

Jeroen van Bemmel,

It may be better to implement these functions using intrinsics, that way the compiler is better able to choose the right registers:

  long diff = cs-ct;
  long nextbytes = 16; // asm ("rdx") does not work
  ct -= 16;

// Force GCC to use a register for nextbytes? Makes the code much worse! (adds LEA)
// __asm__ __volatile__( "mov $16, %0" : "=r"(nextbytes) );

	// could align it ASMVOLATILE( ".align 16\n" : : : "memory" );
	__m128i ct16chars = _mm_loadu_si128( (const __m128i *) (ct += nextbytes) );
	int offset = _mm_cmpistri( ct16chars, * (const __m128i *) (ct+diff), 
                       _SIDD_CMP_EQUAL_EACH | _SIDD_NEGATIVE_POLARITY );
	__asm__ __volatile__ goto( "ja %l[loop] \n jc %l[not_equal]" : : : "memory" : loop, not_equal );
	return 0;

	return ct[diff+offset] - ct[offset];

I did not manage to get gcc to emit instructions which use a register instead of a direct constant - it could be that the compiler believes it performs worse...

(NB am using "asm goto" feature in the above code, it interacts poorly with optimization sometimes because gcc does not know it depends on the flags status)

Jeroen van Bemmel,

Comment on the SSE4.2 strlen() implementation: it actually performs much worse than the following SSE2 implementation (inspired by Agner Fog's implementation):

size_t strlen( const char* s ) {
__m128i zero = _mm_set1_epi8( 0 );
__m128i *s_aligned = (__m128i*) (((long)s) & -0x10L);
u8 misbits = ((long)s) & 0xf;
__m128i s16cs = _mm_load_si128( s_aligned );
__m128i bytemask = _mm_cmpeq_epi8( s16cs, zero );
int bitmask = _mm_movemask_epi8( bytemask );
bitmask = (bitmask>>misbits)<<misbits;

// Alternative: use TEST instead of BSF, then BSF at end (only). Much better on older CPUs
// TEST has latency 1, while BSF has 3!
while (bitmask==0) {
	s16cs = _mm_load_si128( ++s_aligned );
	bytemask = _mm_cmpeq_epi8( s16cs, zero );
	bitmask = _mm_movemask_epi8( bytemask );

// bsf only takes 1 clock cycle on modern cpus
return ((( const char* ) s_aligned ) - s) + __builtin_ctz(bitmask);

Measurements using Agner Fog's testPMC routines on a Core i7, calling 100000 * strlen() for 4 test strings:

C++ inlined version of the above code:

Processor 0

     Clock    Core cyc   Instruct       Uops  BrMispred 
   6977183    4862520   11500107   12627774         29 
   6558494    4810778   11500010   12600524         22 
   6640618    4852955   11500011   12600727          7 
   6559492    4810098   11500010   12600491          8 
   6566901    4805810   11500010   12600462          9 
   6535459    4796366   11500010   12600415          3 
   6542663    4803975   11500010   12600415          4 
   6555762    4797329   11500011   12600935         13

SSE2 version:

Processor 0

     Clock    Core cyc   Instruct       Uops  BrMispred 
  42095645   33382200    2500115   48931751          1 
  31560226   33403649    2500014   48910384          2 
  30967280   33373350    2500016   48907965          3 
  30473645   33376225    2500015   48907719          2 
  30468486   33378743    2500016   48907860          1 
  30474334   33377705    2500016   48907859          1 
  30473657   33378184    2500015   48907638          1 
  30474393   33376966    2500015   48908046          1

So even though the SSE4 strlen() code above contains far less instructions, these translate into many uops, resulting in more CPU cycles

Jeroen van Bemmel,

Correction: second measurement above is for the SSE4 version of course

Peter Kankowski,

Sorry for the late answer. My results on Core i5:

Short strings (from 1 to 10 characters, from Gettysburg Address)
Microsoft's strlen    990
my_strlen             640
Agner Fog's strlen   1132
SSE2 (your version)   886
SSE4.2                613

A long string ("Jabberwocky")
Microsoft's strlen    919
my_strlen            1156
Agner Fog's strlen   2200
SSE2 (your version)   533
SSE4.2                610

So you version is the winner for long strings. You may be interested to look at a strchr implementation using SSE2. In comparison with it, you carefully avoided the branch for the first misaligned chunk. Nice job!


Peter, related but a bit of the topic, have you ever went about optimizing pattern matching with a mask?

Basically the common byte search with wildcards that typically looks like this:

"55 8B EC 6A FF 68 ?? ?? ?? ?? 64 A1 00 00 00 00 50 64 89 25 00 00 00 00 81"

There the "??" represent the variable bytes that you want to skip.

There are many ways of representing this kind of data that could be baked beforehand.

One could have an array of the same size bytes using flags '01' for keep/compare and '00' to skip, or sort of run length encoding, etc.; and of course not dealing with the pattern as a string (no need for that overhead).

I have this idea that one could use SSE instructions with some sort of mask, using bit operation, one might be able to process many pattern bytes at the time.

Any ideas?

Peter Kankowski,

You already described it. Instead of 01 for keep/compare, use 0xFF. So in the mask, you will have 0xFF for each byte you want to compare and 0 for each byte you want to skip. Then run PAND instruction between the loaded bytes and the mask. After PAND, all skipped bytes will turn to zero. In the pattern, you should have zeros for skipped bytes, so that they will be equal.

Renat Saifutdinov,

Any performance tests for strlen_sse42 function does not make sense, because the function has a bug causing crash if a null character of source string is located near to the protected memory page boundary (function always read first 16 bytes, but).

Aligned read with mask false bits is necessary.

Test for Windows platform:

#include <windows.h>

int length;

int main()
  enum { N = 0x1000 };

  // allocate two pages
  char *str = (char*)VirtualAlloc(0, N + N, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
  // protect next page
  DWORD old;
  BOOL res = VirtualProtect(str + N, N, PAGE_NOACCESS, &old);

  size_t i;

  // fill first page
  for(i = 0; i != N; ++i)
    str[i] = 'a' + (i % 20);
    if((i % 13) == 0) str[i] = 0;
  str[N - 1] = 0; // terminator at end of first page

  for(size_t i = 0; i != N; ++i)
    length = strlen_sse42(str + i); // <<< crash!
    if(strlen(str + i) != length) *(char*)0 = 0;
  return 0;

Peter Kankowski,

Thank you, I will change it to use aligned reads.


I found an evil hack for this while I was trying to figure out the flags. After this section:


add edx, 16

      MovDqU    xmm0, dqword[edx]

; find the first *different* bytes, hence negative polarity

      PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH + NEGATIVE_POLARITY
      ja STRCMP_LOOP

ecx is only 16 here if the strings are equal (and the terminator has been found), so the final sub just needs to yield 0. In particular, if we turn 16 into 0:

and ecx, 15

we make this yield 0:

      movzx eax, byte[eax + ecx]
      movzx edx, byte[edx + ecx]

sub eax, edx

since the first bytes of the comparison must be equal in this case. For other cases ecx is unchanged.

I don't know if it's better, but it's one way to handle the exit conditions.

Beautiful !

Thanx Peter

32bit implementation was exactly what i were looking for ..

Works better than native VC12

Your name:


Please ignore this field: