Benchmarking CRC32 and PopCnt instructions

Recent Intel processors (Core i7/i5) can calculate CRC-32C and bit count in hardware. With other domain-specific instructions (string handling, AES encryption) in SSE 4.2, it may seem that low-level optimization for these algorithms will soon become irrelevant, because the hardware acceleration is always faster. But do the new instructions cover all usage cases?

CRC32 instruction

CRC (Cyclic Redundancy Check) is a remainder from dividing your message by a polynomial. Most popular file formats and protocols (Ethernet, MPEG-2, ZIP, RAR, 7-Zip, GZip, and PNG) use the polynomial 0x04C11DB7, while Intel's hardware implementation is based on another polynomial, 0x1EDC6F41 (used in iSCSI and Btrfs). Newly designed protocols and formats can choose the second polynomial to benefit from hardware acceleration, but CRC-32 with polynomial 0x04C11DB7 has to be calculated in software. The CRC32 instruction is not supported by AMD processors.

Most programmers use Dilip Sarwate's table-lookup code for calculating CRC. Intel researchers Michael Kounavis and Frank Berry devised a fast slicing algorithm with unrolled loop and precalculated shifted CRC values (thanks to Andrew M. for pointing this out). The optimized version cannot beat hardware implementation, but it's useful if you have to use the polynomial 0x04C11DB7 or if your processor has no support for the CRC32 instruction.

Another unrolled implementation by Richard Black is included for comparison. It's used in Linux kernel and Info-ZIP.

File size, bytes	250	1025	4103	16384	65537	262151	1048576
Dilip Sarwate	1529	6049	24003	95643	382383	1529357	6231023
Richard Black	1629	6472	25709	102469	409695	1638652	6603394
Slicing-by-4	823	3146	12326	48949	195577	781806	3133560
Slicing-by-8	529	1820	6946	27651	111392	445292	1788348
Hardware	251	732	2660	10329	41057	163957	655555
Hardware, unrolled	263	760	2663	10355	41080	163963	655574

The benchmarks were run on Core i5-750 processor; time is reported in clock cycles. File size is not always a power of 2 to avoid bias in favor of unrolled implementations.

Unrolling does not provide any benefit because of dependency chain between CRC32 instructions. 64-bit version of the instruction should be faster (not tested yet).

PopCnt instruction

Calculating the number of bits set (population count) is useful for implementing sparse arrays or finding Hamming distance. IBM Stretch, CDC 6000, SPARC V9, and Alpha 21264A processors have an instruction for this purpose. Previously, there were no such instruction in x86 ISA, but now it's implemented by Intel and AMD.

AMD also added support for LZCNT instruction, which can be replaced with BSR and CMOVZ. Intel does not implement this instruction.

File size, bytes	250	1025	4103	16384	65537	262151	1048576
Bit hacks	569	2080	8055	31909	126766	507126	2036128
Table	474	1723	6565	25940	103983	415989	1670457
SSE2	212	566	1783	6749	26797	107134	433857
Hardware	206	683	2443	9426	37466	149263	606648
Hardware by Subbu N	215	546	1829	6946	30063	119349	484509
Hardware, unrolled	206	451	1226	4572	17523	69940	286254
SSSE3	169	423	1195	4266	16931	68083	277169

The branchless software implementation uses clever bit hacks to avoid branching: the adjacent bits are summed up in a tree-like pattern. Another function just looks up a 256-element table with number of bits for each possible byte.

SSE2 implementation is based on algorithm from the book by Henry S. Warren, Jr. The formulas were slightly changed to adapt for SSE2 instruction set. Bit count is accumulated in a byte vector. When it can overflow a byte variable, the bytes are horizontally summed up into a DWORD vector. In the original version of the program, PUNPCK* instructions were used for that. Wojciech Muła found a faster way with PSADBW instruction (packed sum of absolute differences between bytes). Finally, the results are summed up horizontally again into a single 32-bit register.

SSSE3 implementation is devised by Wojciech Muła. Each byte is separated into two nibbles (4 bits). The PSHUFB instruction looks up a 16-element table located in a XMM register (similar to the table method). The resulting bytes are summed up horizontally. The only disadvantage of the method: SSSE3 (PSHUFB instruction) is not supported on AMD processors.

Data alignment speeds up the code by 8% and makes it more portable (thanks to Dmitry for the idea!) To avoid the alignment loop for POPCNT, you can read the first partial word and mask out invalid bytes. Agner Fog uses this trick in his strlen implementation. A similar code reads the last partial word without unnecessary branches. SSE2 and SSSE3 functions also benefit from this optimization.

Conclusion

The hardware-accelerated CRC32 and population count are useful additions to x86 ISA. They are 2-3 times faster than smartly optimized calculations in software. However, old optimization tricks are still relevant if you have to use another polynomial for CRC. In this case, you can use "Slicing-by-8" algorithm.

For population count, PSHUFB (SSSE3) is 3% faster than the specialized POPCNT instruction. You can use POPCNT (unrolled hardware implementation) if you need to target AMD processors.

Download the code (MSVC, 30 KB)

44 comments

Dmitry Kostjuchenko, 14 years ago

Piotr, it is indeed an excellent and very useful generalisation of CRC32 algorithms! SSE2 is a winner for a generic-level CPUs, not speaking about CRC32 intrinsic.

While reviewing implementations I would like to add some remarks for those who might consider using Slicing-By-8 algoritm (as a winner within non-SIMD powered algorithms) on platforms with srict data alignment requirements, for example ARM. All algorithms (except Dilip Sarwate) read a byte stream into a 32-bit integer through pointer casting. If bytestream is not aligned to 4-bytes your application will trigger hardware exception.

To circumvent this problem algorithm can be sligtly modified to have alignment requirements fulfilled:

RES CRC_SlicingBy8_Portable(const BYTE* buf, SIZE_T len) {
    RES crc = CRCINIT;
    while (((((intptr_t)buf) & (sizeof(DWORD)-1)) != 0) && (len != 0))
    {
        crc = g_crc_slicing[0][(crc ^ *buf++) & 0xFF] ^ (crc >> 8);
        -- len;
    }
    SIZE_T nqwords = len / (sizeof(DWORD) + sizeof(DWORD));
    for (; nqwords; nqwords--) {
        crc ^= *(DWORD*)buf;
        buf += sizeof(DWORD);
        UINT next = *(DWORD*)buf;
        buf += sizeof(DWORD);
        crc =
            g_crc_slicing[7][(crc      ) & 0xFF] ^
            g_crc_slicing[6][(crc >>  8) & 0xFF] ^
            g_crc_slicing[5][(crc >> 16) & 0xFF] ^
            g_crc_slicing[4][(crc >> 24)] ^
            g_crc_slicing[3][(next      ) & 0xFF] ^
            g_crc_slicing[2][(next >>  8) & 0xFF] ^
            g_crc_slicing[1][(next >> 16) & 0xFF] ^
            g_crc_slicing[0][(next >> 24)];
    }
    len &= sizeof(DWORD) * 2 - 1;
    for (; len; len--)
        crc = g_crc_slicing[0][(crc ^ *buf++) & 0xFF] ^ (crc >> 8);
    return ~crc;
}

Here you can notice while (((((intptr_t)buf) & (sizeof(DWORD)-1)) != 0) && (len != 0)) aligning loop. Such alignment does not affect performance and provides robust operations on any platform.

I made test results, and non-surprisingly 64-bit version of Slicing-by-8 which aligns data to 4-bytes wins from its unaligned parent by around 8%!

-------------
x86 (32-bit):
-------------
Slicing-by-8          511 [caec5f3b] 1965 [400a505f] 7797 [70fc7e1b] 30891 [de08a668] 127362 [13737977] 553983 [6252b8a2] 2327267 [c2548ec1]
Slicing-by-8-Portable 510 [caec5f3b] 1969 [400a505f] 7801 [70fc7e1b] 30895 [de08a668] 127362 [13737977] 553970 [6252b8a2] 2326302 [c2548ec1]

-------------
x64 (64-bit):
-------------
Slicing-by-8          538 [caec5f3b] 2084 [400a505f] 8270 [70fc7e1b] 32799 [de08a668] 134208 [13737977] 572482 [6252b8a2] 2398499 [c2548ec1]
Slicing-by-8-Portable 497 [caec5f3b] 1910 [400a505f] 7582 [70fc7e1b] 30033 [de08a668] 124244 [13737977] 545947 [6252b8a2] 2284239 [c2548ec1]

For those who got interested why alignment is that important for a performance there is a short but useful post about data structures packing and alignment: http://blogs.msdn.com/b/ce_base/archive/2008/06/27/when-not-to-pack-structures.aspx

Peter Kankowski, 14 years ago

Thank you very much! You can avoid the second loop condition by calculating the minimum of required alignment and length:

RES CRC_SlicingBy8(const BYTE* buf, SIZE_T len) {
    RES crc = CRCINIT;

    // Align to DWORD boundary
    SIZE_T align = (sizeof(DWORD) - (INT_PTR)buf) & (sizeof(DWORD) - 1);
    align = Min(align, len);
    len -= align;
    for (; align; align--)
        crc = g_crc_slicing[0][(crc ^ *buf++) & 0xFF] ^ (crc >> 8);

    SIZE_T nqwords = len / (sizeof(DWORD) + sizeof(DWORD));
    for (; nqwords; nqwords--) {
        ...
    }

    len &= sizeof(DWORD) * 2 - 1;
    for (; len; len--)
        crc = g_crc_slicing[0][(crc ^ *buf++) & 0xFF] ^ (crc >> 8);
    return ~crc;
}

I will download a 64-bit compiler and do more tests with alignment in the next few days. Thanks again.

ace, 14 years ago

I was always interested in the effects that happen when the data isn't inside of the CPU cache -- imagine having two additional runs for data that are e.g. maxcache*2 and maxcache*4 big? What would be the results then?

           === CRC32 ===
                       16 MB        32 MB
Dilip Sarwate      103655366    208210920
Richard Black      110219349    222228697
Slicing-by-4        53124346    106272637
Slicing-by-8        30486717     61106160

Hardware            11246432     22913300
HardwareUnrl        11217245     22625766


           === POPCNT ===
                       16 MB        32 MB
Bit Hacks           34829000     69642759
Table               28520317     57367191
SSE                  7799932     15888929

Hardware            10303028     21050460
HardwareUnrl         5463780     11708857

Marat Dukhan, 14 years ago

SSSE3 instruction set includes a very powerful instruction PSHUFB. It actually performs a 16-entry parallel table lookup. However, it is possible to use this instruction for 256-entry table lookup as well (at the cost of 16 calls of this instruction). Core2/45nm can execute this instruction every clock cycle with 1-cycle latency, and Nehalem can execute even two such instructions per clock cycle. So, this instruction could be used to speed up both CRC calculation and popcnt. I have found code for POPCNT at http://wm.ite.pl/articles/sse-popcount.html

Marat, thank you very much! I did some preliminary benchmarks, and PSHUFB implementation was as fast as SSE 4.2 POPCNT! The only problem: SSSE3 (PSHUFB instruction) is not supported on AMD processors. I will ask permission from the author of the article and will publish the updated benchmarks in the next few days.

JC Meyrignac, 14 years ago

Just a small bug in the source code:

for (SIZE_T i = 0; i < NUM_OF(crc_funcs); i++)
TestPopCntFunc(popcnt_funcs[i]);

should be:
for (SIZE_T i = 0; i < NUM_OF(popcnt_funcs); i++)
TestPopCntFunc(popcnt_funcs[i]);

Thank you very much. I corrected it.

Also, the lines:
TestPopCntFunc(POPCNT_Hardware);
TestPopCntFunc(POPCNT_HardwareUnrolled);
create an exception on a processor not having hardware popcnt.

I'm searching for the fastest possible 64 bits popcount.
Here is one routine when no hardware popcnt is available:
int POPCNT64(UINT64 v)
{
unsigned int v1, v2;
v1 = (unsigned int) (v & 0xFFFFFFFF);
v1 -= (v1 >> 1) & 0x55555555;
v1 = (v1 & 0x33333333) + ((v1 >> 2) & 0x33333333);
v1 = (v1 + (v1 >> 4)) & 0x0F0F0F0F;
v2 = (unsigned int) (v >> 32);
v2 -= (v2 >> 1) & 0x55555555;
v2 = (v2 & 0x33333333) + ((v2 >> 2) & 0x33333333);
v2 = (v2 + (v2 >> 4)) & 0x0F0F0F0F;
return ((v1 * 0x01010101) >> 24) + ((v2 * 0x01010101) >> 24);
}

Is it possible to do better ?
(not much information on this particular function on the web)

JC, yes you can probably do better. It depends on your needs, on which processor do you want it to run? Maybe you need 64-bit count on 32-bit processor? Then without SSE, depending on the CPU implementation, the "Table" can be faster. And on 64-bit processor without SSE you can use all 64-bits of registers, therefore the constants should be 64-bit wide, and that's then twice as fast as the code you quote. The fastest implementation depends on the properties of the target CPU and the mode in which you want the code to run.

Ace, you are totally right, I forgot this one, since I'm working on a 32 bits system.

Anyway, I found some solutions.
If somebody is interested, I can post code here...

Certainly, it will be interesting for many people. You are welcomed to post it here.

Don Higgins, 14 years ago

There is a new POPCNT instruction coming on the new IBM zEnterprise z196 mainframe processor. As a result I added new Mainframe Assembler Coding Contest Problem #22 to code mainframe POPCNT instruction using currently available z9/z10 instructions.

You can find 4 solutions with source code and timing results posted here:

http://www.z390.org/z390_Mainframe_Assemble_Coding_Contest.htm

Don Higgins
don@higgins.net

I'm late, but here are the routines:
http://multi-up.com/318633

in C++ and optimized ASM.

Martin B, 14 years ago

I think the SSSE3 version of PopCnt could be made even faster by removing the initial 4-bit masking. PSHUFB already handles that and ignores the upper 4-bits, so you could remove the two initial lo/hi AND's.

Unfortunately, when the most significant bit is set, PSHUFB writes 0 in the result byte (see Intel's description of the instruction). So you have to mask out the higher bits anyway.

Yaniv, 13 years ago

1. This is the latests algorithm used in linux kernel
http://lxr.linux.no/linux+v2.6.36/lib/crc32.c
Where does it stands in the benchmark table (is it "Richard Black"?)

2. We are using Java latest J2SE JDK which uses CRC32 for DEFALTE (on 64Bit machine)
Anyone knows which algorithm Sun uses?

Thanks

Subbu N, 13 years ago

Hi,

your implementation of POPCNT_Hardware() is sensitive to data alignment issue and also has performance issue. The modified one below rectifies both the issue.

static RES POPCNT_Hardware(const BYTE * buf, SIZE_T len) {
	RES cnt = 0;
	
	SIZE_T align = sizeof(DWORD) - (INT_PTR)buf & (sizeof(DWORD) - 1);
	align = (len < align) ? 0 : align;
	if (align) {
		buf = (BYTE *)( (INT_PTR)buf & (-(INT_PTR)sizeof(DWORD)) );
		DWORD first = *(DWORD*)buf;
		first &= ~(0xFFFFFFFF >> (align * CHAR_BIT));
		cnt = __popcnt(first);
		buf += sizeof(DWORD);
		len -= align;
	}


	SIZE_T ndwords = len / sizeof(DWORD);

	DWORD* pInt = (DWORD*)buf;
	while(ndwords)
		cnt += __popcnt(pInt[--ndwords]);   // performance optimization: Removed ptr addition.

	buf += len / sizeof(DWORD) * sizeof(DWORD);


	if (len & sizeof(WORD)) {
		cnt += __popcnt(*(WORD*)buf);
		buf += sizeof(WORD);
	}
	if (len & sizeof(BYTE))
		cnt += __popcnt(*buf);
	return cnt;
}

Peter Kankowski, 13 years ago

1. Yaniv, I already mentioned in the article that "Richard Black" is used in Linux kernel. 2. If the Java virtual machine is not open source, you have no practical way to find out.

Subbu N, thank you very much! I added your function to the benchmark. Surprisingly, reversed memory access (from the end to the beginning of array) is fast! Usually, it's much slower, so you have to use a special trick, counting from "minus length" to zero:

INT_PTR ndwords = len / sizeof(DWORD);

DWORD* pInt = (DWORD*)buf + ndwords;
ndwords = -ndwords;
while(ndwords)
    cnt += __popcnt(pInt[ndwords++]);

But on my Core i5, there is no difference between this code and yours! Probably, they optimized reversed memory access in the recent processors.

none, 13 years ago

Why on earth did they limit this instruction to one specific polynomial instead of loading it from a a register or something?

A CRC instruction with arbitrary polynomial would be slower. The classic hardware implementation is a feedback shift register with hardwired polynomial (bit-by-bit calculation). Intel most likely uses a lookup table in ROM (lower latency; does not need 32 iterations to get the result).

But I wonder why they choose the iSCSI polynomial, which is rarer than 0x04C11DB7.

Marat Dukhan, 13 years ago

Here are the results from Sandy Bridge. I have compiled x86 and x64 versions, both with and without AVX

Compiler: Microsoft C/C++ Optimizing Compiler Version 16.00.30319.01

x86 results

                               250               1025               4103              16384              65537             262151            1048576
Dilip Sarwate      1652 [f0c40ca9]    6569 [7b30d248]   26738 [f2a67bc2]  108271 [ab540a8f]  409052 [1e3904dc] 1644600 [500bdca5] 6055046 [8b91984d]
Richard Black      1484 [f0c40ca9]    6092 [7b30d248]   24356 [f2a67bc2]   97302 [ab540a8f]  389332 [1e3904dc] 1557346 [500bdca5] 6229774 [8b91984d]
Slicing-by-4        751 [f0c40ca9]    2981 [7b30d248]   11867 [f2a67bc2]   47283 [ab540a8f]  189350 [1e3904dc]  757532 [500bdca5] 2815542 [8b91984d]
Slicing-by-8        398 [f0c40ca9]    1600 [7b30d248]    6348 [f2a67bc2]   25250 [ab540a8f]  101010 [1e3904dc]  404052 [500bdca5] 1618448 [8b91984d]

Hardware            184 [f0c40ca9]     630 [7b30d248]    2406 [f2a67bc2]    9492 [ab540a8f]   37853 [1e3904dc]  151304 [500bdca5]  605043 [8b91984d]
HardwareUnrl        107 [f0c40ca9]     625 [7b30d248]    2403 [f2a67bc2]    9489 [ab540a8f]   37858 [1e3904dc]  151311 [500bdca5]  605037 [8b91984d]


                               250               1025               4103              16384              65537             262151            1048576
Bit Hacks           406 [     3d5]    1597 [    100a]    6317 [    4012]   25221 [    ff15]  100960 [   3fc82]  403982 [   ffe8a] 1620071 [  40032a]
Table               344 [     3d5]    1357 [    100a]    5289 [    4012]   21089 [    ff15]   84615 [   3fc82]  338883 [   ffe8a] 1261582 [  40032a]
SSE2                100 [     3d5]     340 [    100a]    1292 [    4012]    5064 [    ff15]   20274 [   3fc82]   81140 [   ffe8a]  328158 [  40032a]
SSE2-AVX             86 [     3d5]     278 [    100a]    1060 [    4012]    4138 [    ff15]   16506 [   3fc82]   66008 [   ffe8a]  267506 [  40032a]

Hardware            152 [     3d5]     590 [    100a]    2240 [    4012]    8880 [    ff15]   35462 [   3fc82]  141778 [   ffe8a]  567476 [  40032a]
HardwareSubbuN      162 [     3d5]     578 [    100a]    2224 [    4012]    8806 [    ff15]   35140 [   3fc82]  140464 [   ffe8a]  561788 [  40032a]
HardwareUnrl         90 [     3d5]     428 [    100a]    1668 [    4012]    6600 [    ff15]   26360 [   3fc82]  105368 [   ffe8a]  421494 [  40032a]
SSSE3                78 [     3d5]     230 [    100a]     884 [    4012]    3420 [    ff15]   13640 [   3fc82]   54556 [   ffe8a]  221278 [  40032a]
SSSE3-AVX            66 [     3d5]     200 [    100a]     762 [    4012]    2944 [    ff15]   11878 [   3fc82]   47478 [   ffe8a]  193446 [  40032a]

x64 results

                               250               1025               4103              16384              65537             262151            1048576
Dilip Sarwate      1748 [f0c40ca9]    7117 [7b30d248]   28431 [f2a67bc2]  113477 [ab540a8f]  454003 [1e3904dc] 1816049 [500bdca5] 6746044 [8b91984d]
Richard Black      1330 [f0c40ca9]    5490 [7b30d248]   21940 [f2a67bc2]   87760 [ab540a8f]  351228 [1e3904dc] 1404920 [500bdca5] 5619578 [8b91984d]
Slicing-by-4        720 [f0c40ca9]    2852 [7b30d248]   11265 [f2a67bc2]   45012 [ab540a8f]  180301 [1e3904dc]  721551 [500bdca5] 2680570 [8b91984d]
Slicing-by-8        421 [f0c40ca9]    1659 [7b30d248]    6634 [f2a67bc2]   26353 [ab540a8f]  105649 [1e3904dc]  423040 [500bdca5] 1572920 [8b91984d]

Hardware             18 [f0c40ca9]      20 [7b30d248]      18 [f2a67bc2]      20 [ab540a8f]      20 [1e3904dc]      20 [500bdca5]      20 [8b91984d]
HardwareUnrl         20 [f0c40ca9]      20 [7b30d248]      20 [f2a67bc2]      20 [ab540a8f]      20 [1e3904dc]      20 [500bdca5]      20 [8b91984d]


                               250               1025               4103              16384              65537             262151            1048576
Bit Hacks           348 [     3d5]    1352 [    100a]    5348 [    4012]   21336 [    ff15]   85350 [   3fc82]  341148 [   ffe8a] 1368242 [  40032a]
Table               292 [     3d5]    1112 [    100a]    4402 [    4012]   17554 [    ff15]   70302 [   3fc82]  281560 [   ffe8a] 1127224 [  40032a]
SSE2                 86 [     3d5]     317 [    100a]    1203 [    4012]    4723 [    ff15]   18871 [   3fc82]   69369 [   ffe8a]  283502 [  40032a]
SSE2-AVX             83 [     3d5]     317 [    100a]    1200 [    4012]    4723 [    ff15]   18871 [   3fc82]   69341 [   ffe8a]  282360 [  40032a]

Hardware             20 [     3d5]      20 [    100a]      18 [    4012]      20 [    ff15]      20 [   3fc82]      20 [   ffe8a]      20 [  40032a]
HardwareSubbuN      160 [     3d5]     570 [    100a]    2222 [    4012]    8802 [    ff15]   35132 [   3fc82]  140462 [   ffe8a]  561780 [  40032a]
HardwareUnrl         88 [     3d5]     428 [    100a]    1668 [    4012]    6600 [    ff15]   26352 [   3fc82]  105352 [   ffe8a]  421366 [  40032a]
SSSE3                58 [     3d5]     184 [    100a]     704 [    4012]    2706 [    ff15]   10936 [   3fc82]   43788 [   ffe8a]  178860 [  40032a]
SSSE3-AVX            62 [     3d5]     200 [    100a]     763 [    4012]    2914 [    ff15]   11775 [   3fc82]   47148 [   ffe8a]  192455 [  40032a]

I don't know the reason behind the exceptionally high performance of hardware POPCNT/CRC32 in x64 version, but I've checked that the compiler didn't throw up the entire function.

Thank you very much for reporting this.

On x64, inline assembly is not available. If you replace the inline assembly block with intrinsic __rdtsc:

UINT64 inline GetRDTSC() {
   int a[4];
   __cpuid(a, 0);
   return __rdtsc();
}

then the compiler will inline BenchmarkFunction and stupidly reorder __rdtsc calls:

 00f96 e8 00 00 00 00  call   ?CRC_Hardware@@YAIPEBE_K@Z ; CRC_Hardware
...
 00fc6 0f 31           rdtsc
 00fc8 48 c1 e2 20     shl  rdx, 32   ; 00000020H
 00fcc 48 0b c2        or  rax, rdx
 00fcf 4c 8b c0        mov r8, rax
...
 00fe8 0f 31           rdtsc
 00fea 48 c1 e2 20     shl  rdx, 32   ; 00000020H
 00fee 48 0b c2        or  rax, rdx
 00ff1 41 2b c0        sub  eax, r8d

So it no longer measures the runtime of CRC_Hardware, because the first rdtsc is moved down.

I tried to make GetRDTSC volatile, but it does not help. In this program, you can forbid to inline the function using /Ob1 (in IDE: Optimization → Inline Function Expansion → Only __inline). After that, you will get realistic results:

File size, bytes	250	1025	4103	16384	65537	262151	1048576
Dilip Sarwate	2208	8865	30840	122946	491595	1966240	7891006
Richard Black	1435	5632	22306	88826	355086	1420149	5773114
Slicing-by-4	786	2932	11489	44597	182257	733423	2939149
Slicing-by-8	546	1846	7005	27749	112340	449503	1800263
Hardware	255	740	2666	10335	41063	163969	655517
Hardware, unrolled	263	763	2666	10360	41086	163977	655540

File size, bytes	250	1025	4103	16384	65537	262151	1048576
Bit hacks	586	2132	8163	32269	128960	515403	2065363
Table	469	1720	6491	25588	101040	403306	1628757
SSE2	697	2406	9074	35663	142989	572963	2290717
Hardware	209	666	2383	9086	36655	146783	595880
Hardware by Subbu N	217	569	1954	7483	29960	119309	484340
Hardware, unrolled	195	414	1000	3580	15000	62680	260795
SSSE3	285	774	2489	8069	32017	127612	513063

Tested on Intel Core i5 750 in 64-bit mode.

ace, 13 years ago

See also _ReadWriteBarrier (http://msdn.microsoft.com/en-us/library/f20w0x5e(v=vs.90).aspx) which prevents the compiler reordering of reads and writes to global variables.

Yes, that was likely the case. I recompiled with

UINT64 __declspec(noinline) GetRDTSC() {
	int registers[4];
	__cpuid( registers, 0 );
	return __rdtsc();
}

and I also disabled buffer security check.

x86 results:

                               250               1025               4103              16384              65537             262151            1048576
Dilip Sarwate      1452 [f0c40ca9]    5940 [7b30d248]   22890 [f2a67bc2]   91416 [ab540a8f]  365688 [1e3904dc] 1463802 [500bdca5] 5918320 [8b91984d]
Richard Black      1600 [f0c40ca9]    6560 [7b30d248]   26228 [f2a67bc2]  104806 [ab540a8f]  376094 [1e3904dc] 1503640 [500bdca5] 6056524 [8b91984d]
Slicing-by-4        670 [f0c40ca9]    2660 [7b30d248]   10610 [f2a67bc2]   42424 [ab540a8f]  169608 [1e3904dc]  679368 [500bdca5] 2756370 [8b91984d]
Slicing-by-8        396 [f0c40ca9]    1538 [7b30d248]    6118 [f2a67bc2]   24376 [ab540a8f]   97536 [1e3904dc]  390178 [500bdca5] 1562846 [8b91984d]

Hardware            162 [f0c40ca9]     564 [7b30d248]    2158 [f2a67bc2]    8502 [ab540a8f]   33934 [1e3904dc]  135662 [500bdca5]  542474 [8b91984d]
HardwareUnrl         96 [f0c40ca9]     560 [7b30d248]    2158 [f2a67bc2]    8498 [ab540a8f]   33940 [1e3904dc]  135668 [500bdca5]  542472 [8b91984d]


                               250               1025               4103              16384              65537             262151            1048576
Bit Hacks           382 [     3d5]    1438 [    100a]    5654 [    4012]   22630 [    ff15]   90490 [   3fc82]  362272 [   ffe8a] 1452788 [  40032a]
Table               302 [     3d5]    1202 [    100a]    4740 [    4012]   18950 [    ff15]   75948 [   3fc82]  303740 [   ffe8a] 1217414 [  40032a]
SSE2                 94 [     3d5]     328 [    100a]    1324 [    4012]    4898 [    ff15]   19600 [   3fc82]   78348 [   ffe8a]  316984 [  40032a]
SSE2-AVX             84 [     3d5]     264 [    100a]    1012 [    4012]    3974 [    ff15]   15936 [   3fc82]   63700 [   ffe8a]  258316 [  40032a]

Hardware            134 [     3d5]     568 [    100a]    2162 [    4012]    8570 [    ff15]   34236 [   3fc82]  137000 [   ffe8a]  567548 [  40032a]
HardwareSubbuN      178 [     3d5]     624 [    100a]    2400 [    4012]    9486 [    ff15]   37843 [   3fc82]  151271 [   ffe8a]  604996 [  40032a]
HardwareUnrl        107 [     3d5]     464 [    100a]    1803 [    4012]    7110 [    ff15]   26358 [   3fc82]  105364 [   ffe8a]  421378 [  40032a]
SSSE3                83 [     3d5]     255 [    100a]     985 [    4012]    3686 [    ff15]   14698 [   3fc82]   52668 [   ffe8a]  213448 [  40032a]
SSSE3-AVX            48 [     3d5]     188 [    100a]     728 [    4012]    2830 [    ff15]   11450 [   3fc82]   45916 [   ffe8a]  187116 [  40032a]

x64 results:

                               250               1025               4103              16384              65537             262151            1048576
Dilip Sarwate      1821 [f0c40ca9]    7193 [7b30d248]   28517 [f2a67bc2]  113566 [ab540a8f]  407226 [1e3904dc] 1629914 [500bdca5] 6606056 [8b91984d]
Richard Black      1422 [f0c40ca9]    5578 [7b30d248]   22060 [f2a67bc2]   87832 [ab540a8f]  339288 [1e3904dc] 1356410 [500bdca5] 5429566 [8b91984d]
Slicing-by-4        742 [f0c40ca9]    2724 [7b30d248]   10562 [f2a67bc2]   41942 [ab540a8f]  167510 [1e3904dc]  646902 [500bdca5] 2626942 [8b91984d]
Slicing-by-8        478 [f0c40ca9]    1570 [7b30d248]    6026 [f2a67bc2]   23764 [ab540a8f]   94802 [1e3904dc]  379236 [500bdca5] 1519496 [8b91984d]

Hardware            212 [f0c40ca9]     612 [7b30d248]    2208 [f2a67bc2]    8570 [ab540a8f]   34002 [1e3904dc]  135730 [500bdca5]  561894 [8b91984d]
HardwareUnrl        348 [f0c40ca9]     944 [7b30d248]    3228 [f2a67bc2]   12460 [ab540a8f]   49348 [1e3904dc]  196788 [500bdca5]  542540 [8b91984d]


                               250               1025               4103              16384              65537             262151            1048576
Bit Hacks           408 [     3d5]    1374 [    100a]    5224 [    4012]   20642 [    ff15]   82384 [   3fc82]  330324 [   ffe8a] 1321722 [  40032a]
Table               350 [     3d5]    1150 [    100a]    4322 [    4012]   17012 [    ff15]   68016 [   3fc82]  281618 [   ffe8a] 1089932 [  40032a]
SSE2                174 [     3d5]     400 [    100a]    1276 [    4012]    4750 [    ff15]   18764 [   3fc82]   74916 [   ffe8a]  302132 [  40032a]
SSE2-AVX            162 [     3d5]     352 [    100a]    1150 [    4012]    4294 [    ff15]   15626 [   3fc82]   62314 [   ffe8a]  252708 [  40032a]

Hardware            198 [     3d5]     598 [    100a]    2212 [    4012]    8644 [    ff15]   34336 [   3fc82]  137122 [   ffe8a]  548438 [  40032a]
HardwareSubbuN      200 [     3d5]     600 [    100a]    2194 [    4012]    8560 [    ff15]   33968 [   3fc82]  151335 [   ffe8a]  561848 [  40032a]
HardwareUnrl        188 [     3d5]     502 [    100a]    1732 [    4012]    6670 [    ff15]   25520 [   3fc82]  101792 [   ffe8a]  406906 [  40032a]
SSSE3               134 [     3d5]     324 [    100a]     996 [    4012]    3692 [    ff15]   14566 [   3fc82]   58040 [   ffe8a]  211658 [  40032a]
SSSE3-AVX           124 [     3d5]     256 [    100a]     746 [    4012]    2676 [    ff15]   10604 [   3fc82]   42366 [   ffe8a]  172780 [  40032a]

And also the results with 64-bit POPCNT and CRC32C (all tests on 64-bit code)

CRC32C-32     212 [f0c40ca9]     612 [7b30d248]    2208 [f2a67bc2]    8570 [ab540a8f]   34002 [1e3904dc]  135728 [500bdca5]  605101 [8b91984d]
CRC32C-64     186 [f0c40ca9]     444 [7b30d248]    1534 [f2a67bc2]    5812 [ab540a8f]   23038 [1e3904dc]   91828 [500bdca5]  367046 [8b91984d]
POPCNT-32     198 [     3d5]     598 [    100a]    2206 [    4012]    8644 [    ff15]   34336 [   3fc82]  137106 [   ffe8a]  548404 [  40032a]
POPCNT-64     132 [     3d5]     340 [    100a]    1154 [    4012]    4408 [    ff15]   17388 [   3fc82]   69332 [   ffe8a]  277038 [  40032a]

The code

// Hardware-accelerated CRC-32C (using CRC32 instruction)
static RES CRC_Hardware(const BYTE * buf, SIZE_T len) {
	RES crc = CRCINIT;

	// Align to DWORD boundary
	SIZE_T align = (sizeof(DWORD) - (INT_PTR)buf) & (sizeof(DWORD) - 1);
	align = Min(align, len);
	len -= align;
	for (; align; align--)
		crc = g_crc_slicing[0][(crc ^ *buf++) & 0xFF] ^ (crc >> 8);

#ifdef CRC32C64
	SIZE_T ndwords = len / sizeof(DWORD64);
	for (; ndwords; ndwords--) {
		crc = _mm_crc32_u64(crc, *(DWORD64*)buf);
		buf += sizeof(DWORD64);
	}
	len &= sizeof(DWORD64) - 1;
#else
	SIZE_T ndwords = len / sizeof(DWORD);
	for (; ndwords; ndwords--) {
		crc = _mm_crc32_u32(crc, *(DWORD*)buf);
		buf += sizeof(DWORD);
	}
	len &= sizeof(DWORD) - 1;
#endif
	for (; len; len--)
		crc = _mm_crc32_u8(crc, *buf++);
	return ~crc;
}


// Hardware-accelerated population count (using POPCNT instruction)
static RES POPCNT_Hardware(const BYTE * buf, SIZE_T len) {
	RES cnt = 0;
#ifdef POPCNT64
	SIZE_T ndwords = len / sizeof(DWORD64);
	for(; ndwords; ndwords--) {
		cnt += __popcnt64(*(DWORD64*)buf);
		buf += sizeof(DWORD64);
	}
	if (len & sizeof(DWORD)) {
		cnt += __popcnt(*(DWORD*)buf);
		buf += sizeof(DWORD);
	}
#else
	SIZE_T ndwords = len / sizeof(DWORD);
	for(; ndwords; ndwords--) {
		cnt += __popcnt(*(DWORD*)buf);
		buf += sizeof(DWORD);
	}
#endif
	if (len & sizeof(WORD)) {
		cnt += __popcnt(*(WORD*)buf);
		buf += sizeof(WORD);
	}
	if (len & sizeof(BYTE))
		cnt += __popcnt(*buf);
	return cnt;
}

Thank you! Both functions are much faster in the 64-bit version; it's also interesting that SSSE3 code using AVX is still faster than the 64-bit POPCNT instruction. I will add 64-bit benchmarks to the article.

Ace, thank you; I will try _ReadWriteBarrier.

Oren D, 13 years ago

Thanks,

Helped me a lot,i thought i optimized my code but i guess you can always do better.

Sirmabus, 13 years ago

Spasiba moi druge

Glad I found your site.

Great contribution to science here.

Was thinking there had to be something better then the plain old table way.

If the table memory access has anything to do with it anyhow.

Can you explain how "Align to DWORD boundary" works?

and what the g_crc_slicing do

// Align to DWORD boundary
	SIZE_T align = (sizeof(DWORD) - (INT_PTR)buf) & (sizeof(DWORD) - 1);
	align = Min(align, len);
	len -= align;
	for (; align; align--)
		crc = g_crc_slicing[0][(crc ^ *buf++) & 0xFF] ^ (crc >> 8);

thanks

The code first calculates the number of bytes to the next DWORD boundary (0..3 bytes). If the length is smaller than that, only len bytes with be processed.

You can read about the mathematics of g_crc_slicing in Intel's paper.

Song, 13 years ago

Too bad there isn't an SSE instruction to do column-wise popcount: given a vector of m128 elements, count the number of bits set in the 0th bit of each element, then count the number of bits set in the 1st bit of each element, etc (128 counters in total, regardless of vector size).

I'm forced to count the old fashioned way, with shifts and masks:

r0 += v[0]    & 0x1111...
r1 += v[0]>>1 & 0x1111...
r2 += v[0]>>2 & 0x1111...
r3 += v[0]>>3 & 0x1111...

// continue with v[1], v[2] etc. 
// r0..3 are 128 bits registers
// every 31 counts, spill to larger counter array in memory, reset r0-3 and continue.

The performance is awful - 11 ops and 4 memory accesses for every element in the vector, plus large overhead every 31 elements - much slower than memory speed...

Here is UNIX/C version of Slicing-by-8 by Keith Bostic (including support for big-endian architectures).

Erik Gorset, 13 years ago

Song: there's a neat trick described at http://steike.com/code/bits/vertical-counter/ that does vertical bit counting.

Wojciech Mula, 13 years ago

Hi, paper Anatomy of High-Performance 2D Similarity Calculations (http://www.cs.stanford.edu/people/ihaque/#publications) contains some real-life comparison of popcount procedures. SSSE3 method is nearly as fast as hardware popcnt instruction.

Wojciech, thank you very much.

Mario, 12 years ago

Very cool article, thanks for gathering the algorithms and explaining the experiment.

One thing to note, is that SSE2/SSE3 popcount techniques work on 16byte chunks, so if you need to work on singe words here and there they are not that convenient.

Peter Kankowski, 12 years ago

Population count algorithms: http://chessprogramming.wikispaces.com/Population+Count

Marcin Zukowski, 11 years ago

Peter, great website and an interesting comparison. For completeness would be nice if you used the 64-bit crc32 version - should provide an extra boost.

Btw, in Song's question/solution, doesn't he need to sum every 15 counts, not 31?

Peter Kankowski, 11 years ago

Marcin, thank you for the kind words. Yes, it should be 15 counts, not 31. I will do the benchmarks for the 64-bit crc32 instruction in future.

Evgeniy, 11 years ago

Peter, CRC32 has a latency of 3 cycles and a throughput of 1 cycle.

Using 64-bit instruction and splitting input buffer into three chunks will make hardware implementation noticeably faster.

Benchmarking CRC32 and PopCnt instructions

CRC32 instruction

PopCnt instruction

Conclusion

About the author

44 comments

Featured pages

Recent comments