Abstract:
Exploring new instructions in Core i7/i5.

Created by Peter Kankowski
Last changed
Contributors: Dmitry Kostjuchenko, Marat Dukhan, and Subbu N
Filed under Low-level code optimization

Share on social sitesReddit Digg Delicious Buzz Facebook Twitter

Benchmarking CRC32 and PopCnt instructions

Recent Intel processors (Core i7/i5) can calculate CRC-32C and bit count in hardware. With other domain-specific instructions (string handling, AES encryption) in SSE 4.2, it may seem that low-level optimization for these algorithms will soon become irrelevant, because the hardware acceleration is always faster. But do the new instructions cover all usage cases?

CRC32 instruction

CRC (Cyclic Redundancy Check) is a remainder from dividing your message by a polynomial. Most popular file formats and protocols (Ethernet, MPEG-2, ZIP, RAR, 7-Zip, GZip, and PNG) use the polynomial 0x04C11DB7, while Intel's hardware implementation is based on another polynomial, 0x1EDC6F41 (used in iSCSI and Btrfs). Newly designed protocols and formats can choose the second polynomial to benefit from hardware acceleration, but CRC-32 with polynomial 0x04C11DB7 has to be calculated in software. The CRC32 instruction is not supported by AMD processors.

Most programmers use Dilip Sarwate's table-lookup code for calculating CRC. Intel researchers Michael Kounavis and Frank Berry devised a fast slicing algorithm with unrolled loop and precalculated shifted CRC values (thanks to Andrew M. for pointing this out). The optimized version cannot beat hardware implementation, but it's useful if you have to use the polynomial 0x04C11DB7 or if your processor has no support for the CRC32 instruction.

Another unrolled implementation by Richard Black is included for comparison. It's used in Linux kernel and Info-ZIP.

File size, bytes2501025410316384655372621511048576
Dilip Sarwate15296049240039564338238315293576231023
Richard Black162964722570910246940969516386526603394
Slicing-by-4823314612326489491955777818063133560
Slicing-by-852918206946276511113924452921788348
Hardware25173226601032941057163957655555
Hardware, unrolled26376026631035541080163963655574

The benchmarks were run on Core i5-750 processor; time is reported in clock cycles. File size is not always a power of 2 to avoid bias in favor of unrolled implementations.

Unrolling does not provide any benefit because of dependency chain between CRC32 instructions. 64-bit version of the instruction should be faster (not tested yet).

PopCnt instruction

Calculating the number of bits set (population count) is useful for implementing sparse arrays or finding Hamming distance. IBM Stretch, CDC 6000, SPARC V9, and Alpha 21264A processors have an instruction for this purpose. Previously, there were no such instruction in x86 ISA, but now it's implemented by Intel and AMD.

AMD also added support for LZCNT instruction, which can be replaced with BSR and CMOVZ. Intel does not implement this instruction.

File size, bytes2501025410316384655372621511048576
Bit hacks56920808055319091267665071262036128
Table47417236565259401039834159891670457
SSE22125661783674926797107134433857
Hardware2066832443942637466149263606648
Hardware by Subbu N2155461829694630063119349484509
Hardware, unrolled206451122645721752369940286254
SSSE3169423119542661693168083277169

The branchless software implementation uses clever bit hacks to avoid branching: the adjacent bits are summed up in a tree-like pattern. Another function just looks up a 256-element table with number of bits for each possible byte.

SSE2 implementation is based on algorithm from the book by Henry S. Warren, Jr. The formulas were slightly changed to adapt for SSE2 instruction set. Bit count is accumulated in a byte vector. When it can overflow a byte variable, the bytes are horizontally summed up into a DWORD vector. In the original version of the program, PUNPCK* instructions were used for that. Wojciech Muła found a faster way with PSADBW instruction (packed sum of absolute differences between bytes). Finally, the results are summed up horizontally again into a single 32-bit register.

SSSE3 implementation is devised by Wojciech Muła. Each byte is separated into two nibbles (4 bits). The PSHUFB instruction looks up a 16-element table located in a XMM register (similar to the table method). The resulting bytes are summed up horizontally. The only disadvantage of the method: SSSE3 (PSHUFB instruction) is not supported on AMD processors.

Data alignment speeds up the code by 8% and makes it more portable (thanks to Dmitry for the idea!) To avoid the alignment loop for POPCNT, you can read the first partial word and mask out invalid bytes. Agner Fog uses this trick in his strlen implementation. A similar code reads the last partial word without unnecessary branches. SSE2 and SSSE3 functions also benefit from this optimization.

Conclusion

The hardware-accelerated CRC32 and population count are useful additions to x86 ISA. They are 2-3 times faster than smartly optimized calculations in software. However, old optimization tricks are still relevant if you have to use another polynomial for CRC. In this case, you can use "Slicing-by-8" algorithm.

For population count, PSHUFB (SSSE3) is 3% faster than the specialized POPCNT instruction. You can use POPCNT (unrolled hardware implementation) if you need to target AMD processors.

Download the code (MSVC, 30 KB)

Peter Kankowski
Peter Kankowski

About the author

Peter is the developer of Aba Search and Replace, a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP. He can be reached at kankowski@gmail.com.

Created by Peter Kankowski
Last changed
Contributors: Dmitry Kostjuchenko, Marat Dukhan, and Subbu N

47 comments

Ten recent comments are shown below. Show all comments

Wojciech Mula,

Hi, paper Anatomy of High-Performance 2D Similarity Calculations (http://www.cs.stanford.edu/people/ihaque/#publications) contains some real-life comparison of popcount procedures. SSSE3 method is nearly as fast as hardware popcnt instruction.

Peter Kankowski,

Wojciech, thank you very much.

Mario,

Very cool article, thanks for gathering the algorithms and explaining the experiment.

One thing to note, is that SSE2/SSE3 popcount techniques work on 16byte chunks, so if you need to work on singe words here and there they are not that convenient.

Peter Kankowski,

Population count algorithms: http://chessprogramming.wikispaces.com/Population+Count

Marcin Zukowski,

Peter, great website and an interesting comparison. For completeness would be nice if you used the 64-bit crc32 version - should provide an extra boost.

Btw, in Song's question/solution, doesn't he need to sum every 15 counts, not 31?

Peter Kankowski,

Marcin, thank you for the kind words. Yes, it should be 15 counts, not 31. I will do the benchmarks for the 64-bit crc32 instruction in future.

Evgeniy,

Peter, CRC32 has a latency of 3 cycles and a throughput of 1 cycle.

Using 64-bit instruction and splitting input buffer into three chunks will make hardware implementation noticeably faster.

Rich Delaney,

Evgeniy sort of beat me to it in the comment above, however I would like to elaborate.

Have you read this article by Intel that discusses how to work around the 3-cycle latency of the Crc32 instruction?

http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf

There are actually four versions of the algorithm there. They all split the input buffer and recombine the results to avoid the 3-cycle latency by always pipelining three Crc32 instructions at once. Two versions use look-up tables to recombine the results, the other two versions use a 64-bit SSE multiplication instruction. One of the SSE multiplication version of the algorithm operates on a 1024-byte sized buffer, this algorithm should be the fastest of the four. The article doesn't discus this, but it can be combined with the version that is not bound to a 1024-byte sized buffer to eat up the tail of the data. These together should be the fastest general case algorithm.

I would be very curious to see this implementation included with your results. So if it's something that interests you I'd like to call it to your attention. Intel claims this is nearly 3x faster than the hardware implementation you have in the article above.

Sooraa,

I use exact code as described in the Intel WP you metioned since long.

I can confirm the throughput figures:

Sandy Bridge @4.0 GHz = 15 ms for 440 MB CRC32'd = approx. 29,3 GB/Sek. = 0,136 cycle per byte
Haswell @4.1 GHz = 14 ms for 440 MB CRC32'd = even fatser
Ashutosh Gupta,

Quote

"JC Meyrignac, 4 years ago

I'm searching for the fastest possible 64 bits popcount.

Here is one routine when no hardware popcnt is available:

int POPCNT64(UINT64 v)
{
unsigned int v1, v2;
v1 = (unsigned int) (v & 0xFFFFFFFF);
v1 -= (v1 >> 1) & 0x55555555;
v1 = (v1 & 0x33333333) + ((v1 >> 2) & 0x33333333);
v1 = (v1 + (v1 >> 4)) & 0x0F0F0F0F;
v2 = (unsigned int) (v >> 32);
v2 -= (v2 >> 1) & 0x55555555;
v2 = (v2 & 0x33333333) + ((v2 >> 2) & 0x33333333);
v2 = (v2 + (v2 >> 4)) & 0x0F0F0F0F;
return ((v1 * 0x01010101) >> 24) + ((v2 * 0x01010101) >> 24);
}

Is it possible to do better ?

(not much information on this particular function on the web)"

Yes, one can do a lot better on a general purpose for x64 architecture

__forceinline int POPCNT(DWORD64 k) {
	k -= (k >> 1) & 0x5555555555555555ull;
	k = (k & 0x3333333333333333ull) + ((k >> 2) & 0x3333333333333333ull);
	k = (k + (k >> 4)) & 0x0F0F0F0F0F0F0F0Full;
	return (k * 0x0101010101010101ull) >> 56;
}

The code from you is mainly optimized for x86 systems. For x64 the above mentioned code is significantly faster.

Regards

Your name:


Comment:


Please ignore this field: