Abstract:
Using new Intel Core i7 instructions to speed up string manipulation.

Created 2 years ago by Peter Kankowski
Last changed 6 months ago
Contributors: Dap and Adam Messinger
Filed under Low-level code optimization

Reddit Digg Delicious Buzz Facebook Twitter

Implementing strcmp, strlen, and strstr using SSE 4.2 instructions

The new instructions

SSE 4.2 introduces four instructions (PcmpEstrI, PcmpEstrM, PcmpIstrI, and PcmpIstrM) that can be used to speed up text processing code (including strcmp, memcmp, strstr, and strspn functions).

Intel had published the description for new instruction formats, but no sample code nor high-level guidelines. This article tries to provide them.

MovDqU    xmm0, dqword[str1]
PcmpIstrI xmm0, dqword[str2], imm

PcmpIstrI is one of the new string-handling instructions comparing their operands. The first operand is always an SSE register (typically loaded with MovDqU). The second operand can be a memory location. Note the immediate operand, which consists of several bit fields controlling operation modes.

Aggregation operations

The heart of a string-processing instruction is the aggregation operation (immediate bits [3:2]).

Equal each (imm[3:2] = 10). This operation compares two strings (think of strcmp or memcmp). The result of comparison is a bit mask (1 if the corresponding bytes are equal, 0 if not equal). For example:

operand1 = "UseFlatAssembler"
operand2 = "UsingAnAssembler"
IntRes1  =  1100000111111111

Equal any (imm[3:2] = 00). The first operand is a character set, the second is a string (think of strspn or strcspn). The bit mask includes 1 if the character belongs to a set, 0 if not:

operand2 = "You Drive Me Mad", operand1 = "aeiouy"
IntRes1  =  0110001010010010

Ranges (imm[3:2] = 01). The first operand consists of ranges, for example, "azAZ" means "all characters from a to z and all characters from A to Z":

operand2 = "I'm here because", operand1 = "azAZ"
IntRes1  =  1010111101111111

Equal ordered (imm[3:2] = 11). Substring search (strstr). The first operand contains a string to search for, the second is a string to search in. The bit mask includes 1 if the substring is found at the corresponding position:

operand2 = "WhenWeWillBeWed!", operand1 = "We"
IntRes1  =  000010000000100

After computing the aggregation function, IntRes1 can be complemented, expanded into byte mask or shrinked into index. The result is written into xmm0 or ECX registers. Intel manual explains these details well, so there is no need to repeate them here.

Other features of SSE 4.2

Implementation

The following strcmp and strlen functions were written and published when the processors with SSE 4.2 support were not available yet. Later they were tested on real hardware and found to be correct.

; compile with FASM
; Immediate byte constants
EQUAL_ANY	    = 0000b
RANGES		    = 0100b
EQUAL_EACH	    = 1000b
EQUAL_ORDERED	    = 1100b
NEGATIVE_POLARITY = 010000b
BYTE_MASK	 = 1000000b

; ==== strcmp ====

strcmp_sse42:
  ; Using __fastcall convention, ecx = string1, edx = string2
  mov eax, ecx
  sub eax, edx ; eax = ecx - edx
  sub edx, 16

STRCMP_LOOP:
    add edx, 16
    MovDqU    xmm0, dqword[edx]
    ; find the first *different* bytes, hence negative polarity
    PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH + NEGATIVE_POLARITY
    ja STRCMP_LOOP

  jc STRCMP_DIFF

  ; the strings are equal
  xor eax, eax
  ret
STRCMP_DIFF:
  ; subtract the first different bytes
  add eax, edx
  movzx eax, byte[eax + ecx]
  movzx edx, byte[edx + ecx]
  sub eax, edx
  ret


; ==== strlen ====
strlen_sse42:
  ; ecx = string
  mov eax, -16
  mov edx, ecx
  pxor xmm0, xmm0

STRLEN_LOOP:
    add eax, 16
    PcmpIstrI xmm0, dqword[edx + eax], EQUAL_EACH
    jnz STRLEN_LOOP

  add eax, ecx
  ret

; ==== strstr ====
strstr_sse42:
  ; ecx = haystack, edx = needle

  push esi
  push edi
  MovDqU xmm2, dqword[edx] ; load the first 16 bytes of neddle
  Pxor xmm3, xmm3
  lea eax, [ecx - 16]

  ; find the first possible match of 16-byte fragment in haystack
STRSTR_MAIN_LOOP:
    add eax, 16
    PcmpIstrI xmm2, dqword[eax], EQUAL_ORDERED
    ja STRSTR_MAIN_LOOP

  jnc STRSTR_NOT_FOUND

  add eax, ecx ; save the possible match start
  mov edi, edx
  mov esi, eax
  sub edi, esi
  sub esi, 16

  ; compare the strings
@@:
    add esi, 16
    MovDqU    xmm1, dqword[esi + edi]
    ; mask out invalid bytes in the haystack
    PcmpIstrM xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK
    MovDqU xmm4, dqword[esi]
    PAnd xmm4, xmm0
    PcmpIstrI xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
    ja @B

  jnc STRSTR_FOUND

  ; continue searching from the next byte
  sub eax, 15
  jmp STRSTR_MAIN_LOOP

STRSTR_NOT_FOUND:
  xor eax, eax

STRSTR_FOUND:
  pop edi
  pop esi
  ret

The first processor with SSE 4.2 support is Intel Core i7, followed by Intel Core i5.

Additional reading

Typo in Intel manual: on figure 5-1, "imm8[6:5]" near Optional boolean negation should be "imm8[5:4]".

8 comments

Splitice, 3 years ago
How would I go about making these inline asm functions in MSVC?
Peter Kankowski, 3 years ago
As I already said, the processors with SSE 4.2 were not officially released yet. The compilers and assemblers only start to implement SSE4 support.

It will not work now. You can read manuals, learn, and wait until you will be able to use your knowledge.
Splitice, 3 years ago
lol thats pretty useless.
dap, 3 years ago
The link to Intel's documentation is dead, I think it should be this one : http://softwarecommunity.intel.com/userfiles/en-us/d9156103.pdf
Peter Kankowski, 3 years ago
Thank you, Dap, I updated the link. Intel changes the location of the manuals so often...
redstone64, 3 years ago
Intel provides c library functions with the hardware which is not a good thing.
We buy these type processor for general usage, not for specific. If this trend continues we may see lots c library functions in the processor core. That makes processor pretty useless !
Adam Messinger, 2 years ago
In your equal any example, will the first bit really be 1? The Y in "You Drive Me Mad" is uppercase whereas the y in "aeiouy" is lowercase. I don't see anything in the documentation about case-insensitive matching.
Peter Kankowski, 2 years ago
Thanks, it was an error. Mea culpa.
Your name:
Comment: