Abstract:
Benchmark program for hash tables and comparison of 15 popular hash functions.

Created by Peter Kankowski
Last changed
Contributors: Nils, Ace, Won, Andrew M., and Georgi 'Sanmayce'
Filed under Algorithms

# Hash functions: An empirical comparison

Hash tables are popular data structures for storing key-value pairs. A hash function is used to map the key value (usually a string) to array index. The functions are different from cryptographic hash functions, because they should be much faster and don't need to be resistant to preimage attack. Hashing in large databases is also left out from this article; the benchmark includes medium-size hash tables such as:

• symbol table in a parser,
• IP address table for filtering network traffic,
• the dictionary in a word counting program or a spellchecker.

There are two classes of the functions used in hash tables:

• multiplicative hash functions, which are simple and fast, but have a high number of collisions;
• more complex functions, which have better quality, but take more time to calculate.

Hash table benchmarks usually include theoretical metrics such as the number of collisions or distribution uniformity (see, for example, hash function comparison in the Red Dragon book). Obviously, you will have a better distribution with more complex functions, so they are winners in these benchmarks.

The question is whether using complex functions gives you a faster program. The complex functions require more operations per one key, so they can be slower. Is the price of collisions high enough to justify the additional operations?

## Multiplicative hash functions

Any multiplicative hash function is a special case of the following algorithm:

UINT HashMultiplicative(const CHAR *key, SIZE_T len) {
UINT hash = INITIAL_VALUE;
for(UINT i = 0; i < len; ++i)
hash = M * hash + key[i];
return hash % TABLE_SIZE;
}


(Sometimes XOR operation is used instead of addition, but it does not make much difference.) The hash functions differ only by values of INITIAL_VALUE and multiplier (M). For example, the popular Bernstein's function uses INITIAL_VALUE of 5381 and M of 33; Kernighan and Ritchie's function uses INITIAL_VALUE of 0 and M of 31.

A multiplicative function works by adding together the letters weighted by powers of multiplier. For example, the hash for the word TONE will be:

  INITIAL_VALUE * M^4  +  'T' * M^3  +  'O' * M^2  +  'N' * M  +  'E'


Let's enter several similar strings and watch the output of the functions:

     Bernstein Kernighan
(M=33)     (M=31)
too   b88af17     1c154
top   b88af18     1c155
tor   b88af1a     1c157
tpp   b88af39     1c174
a000  7c9312d6    2cd22f
a001  7c9312d7    2cd230
a002  7c9312d8    2cd231
a003  7c9312d9    2cd232
a004  7c9312da    2cd233
a005  7c9312db    2cd234
a006  7c9312dc    2cd235
a007  7c9312dd    2cd236
a008  7c9312de    2cd237
a009  7c9312df    2cd238
a010  7c9312f7    2cd24e
a     2b606        61
aa    597727       c20
aaa   b885c68     17841


Too and top are different in the last letter only. The letter P is the next one after O, so the values of hash function are different by 1 (1c154 and 1c155, b88af17 and b88af18). Ditto for a000..a009.

Now let's compare top with tpp. Their hashes will be:

  INITIAL_VALUE * M^3 + 'T' * M^2 + 'O' * M + 'P'
INITIAL_VALUE * M^3 + 'T' * M^2 + 'P' * M + 'P'


The hashes will be different by M * ('P' - 'O') = M. Similarly, when the first letters are different by x, their hashes will be different by x * M^2.

When there are less than 33 possible letters, Bernstein's function will pack them into a number (similar to Radix40 packing scheme). For example, hash table of size 333 will provide perfect hashing (without any collisions) for all three-letter English words written in small letters. In practice, the words are longer and hash tables are smaller, so there will be some collisions (situations when different strings have the same hash value).

If the string is too long to fit into the 32-bit number, the first letters will still affect the value of the hash function, because the multiplication is done modulo 2^32 (in a 32-bit register), and the multiplier is chosen to have no common divisors with 2^32 (in other words, it must be odd), so the bits will not be just shifted away.

There are no exact rules for choosing the multiplier, only some heuristics:

• the multiplier should be large enough to accommodate most of the possible letters (e.g., 3 or 5 is too small);
• the multiplier should be fast to calculate with shifts and additions [e.g., 33 * hash can be calculated as (hash << 5) + hash];
• the multiplier should be odd for the reason explained above;
• prime numbers are good multipliers.

## Complex hash functions

These functions do a good job of mixing together the bits of the source word. The change in one input bit changes a half of the bits in the output (see Avalanche_effect), so the result looks completely random:

     Paul Hsieh One At Time
top   78b5a877  4c5dd09a
tor   c09e2021  f2aa9d35
tpp   3058996d  d5e9e480
a000   7552599f  ed3859d8
a001   3cc1d896  fef7fd57
a002   c6ff5c9b  08a610b3
a003   dcab7b0c  1a88b478
a004   780c7202  3621ebaa
a005   7eb63e3a  47db8f1d
a006   6b0a7a17  b901717b
a007   cb5cb1ab  caec1550
a008   5c2a15c0  e58d4a92
a009   33339829  f75aee2d
a010   eb1f336e  bd097a6b
a   115ea782  ca2e9442
aaa   7dfdc310  ae4f22ec


To achieve this behavior, the hash functions perform a lot of shifts, XORs, and additions. But do we need a complex function? What is faster: tolerating the collisions and resolving them with chaining, or avoiding them with a more complex function?

## Test conditions

The benchmark uses separate chaining algorithm for collision resolution. Memory allocation and other "heavy" functions were excluded from the benchmarked code. The RDTSC instruction was used for benchmarking. The test was performed on Pentium-M and Core i5 processors.

The benchmark inserts some keys in the table, then looks them up in the same order as they were inserted. The test data include:

• the list of common words from Wiktionary (500 items);
• the list of Win32 functions from Colorer syntax highlight scheme (1992 items);
• 500 names from a000 to a499 (imitates the names in auto-generated source code);
• the list of common words with a long prefix and postfix;
• all variable names from WordPress 2.3.2 source code in wp-includes folder (1842 names);
• list of all words in Sonnets by W. Shakespeare (imitates a word counting program; 3228 words);
• list of all words in La Peau de chagrin by Balzac (in French, UTF-8 encoding);
• search engine IP addresses (binary).

## Results

### Core i5 processor

WordsWin32NumbersPrefixPostfixVariablesSonnetsUTF-8IPv4Avg
iSCSI CRC65[105]329[415]36[112]84[106]83[92]280[368]408[584]1964[2388]322[838]1.01[1.78]
Meiyan64[102]328[409]45[125]87[106]85[112]274[350]411[588]1972[2377]353[768]1.05[1.87]
Murmur272[103]378[415]48[104]109[106]106[111]315[383]450[566]2183[2399]399[834]1.21[1.74]
XXHfast3278[110]372[420]57[102]88[103]88[106]315[347]473[491]2323[2494]463[838]1.23[1.71]
SBox70[91]389[431]46[116]124[108]123[91]304[347]430[526]2182[2442]377[836]1.23[1.78]
Larson72[99]401[416]34[16]143[99]141[105]312[366]451[583]2230[2447]349[755]1.25[1.10]
XXHstrong3279[109]385[429]58[102]93[102]92[112]321[355]474[491]2332[2496]464[838]1.25[1.72]
Sedgewick73[107]417[414]36[48]143[103]143[103]319[348]446[570]2246[2437]349[782]1.26[1.33]
Novak unrolled76[113]404[399]43[90]127[118]125[113]322[342]459[581]2284[2430]379[969]1.26[1.68]
CRC-3270[101]429[426]40[64]146[107]143[94]320[338]443[563]2231[2400]357[725]1.28[1.41]
Murmur378[101]391[380]54[104]108[103]107[105]331[334]492[555]2360[2376]433[783]1.28[1.69]
x6559974[111]407[382]45[203]144[107]144[122]316[379]449[560]2221[2373]349[846]1.29[2.45]
FNV-1a74[124]408[428]47[108]144[94]144[105]309[374]440[555]2193[2446]376[807]1.30[1.77]
Murmur2A79[114]410[433]53[102]117[112]114[109]337[365]494[544]2377[2369]429[772]1.31[1.73]
Fletcher71[131]352[406]80[460]104[127]100[108]312[507]481[1052]2477[4893]388[1359]1.31[4.62]
K&R73[106]429[437]47[288]149[94]149[106]324[360]450[561]2266[2365]343[831]1.32[3.00]
Paul Hsieh80[114]410[420]54[118]123[101]121[100]336[341]496[600]2351[2380]433[847]1.33[1.83]
Bernstein75[114]428[412]49[288]150[100]150[102]324[353]460[572]2312[2380]351[703]1.34[2.99]
x17 unrolled78[109]446[415]43[24]156[113]153[102]344[368]472[589]2361[2392]373[829]1.37[1.19]
lookup383[101]459[412]55[97]140[101]137[95]359[361]526[550]2480[2392]427[834]1.42[1.65]
MaPrime2c79[103]459[426]50[106]155[91]155[106]349[349]486[550]2493[2406]406[865]1.42[1.73]
Ramakrishna80[108]513[409]44[91]189[125]186[103]370[360]483[528]2565[2383]380[840]1.51[1.66]
One At Time85[105]562[421]58[110]221[97]220[103]392[364]511[545]2659[2346]459[795]1.72[1.75]
Arash Partow83[101]560[435]71[420]215[98]212[85]392[355]507[570]2638[2372]407[779]1.72[3.88]
Weinberger87[104]590[422]37[100]254[111]273[117]398[364]541[712]2734[2547]419[744]1.78[1.75]
Hanson73[118]417[649]45[112]123[118]1207[499]318[435]448[592]2324[2890]370[833]2.70[2.46]

### Pentium-M processor

WordsWin32NumbersPrefixPostfixVariablesSonnetsUTF-8IPv4Avg
Meiyan80[102]426[409]56[125]123[106]121[112]354[350]525[588]2443[2377]445[768]1.02[1.87]
Novak unrolled90[113]517[399]56[90]169[118]164[113]398[342]575[581]2716[2430]482[969]1.18[1.68]
Fletcher84[131]444[406]102[460]140[127]133[108]374[507]592[1052]2891[4893]513[1359]1.21[4.62]
SBox88[91]552[431]57[116]181[108]178[91]414[347]560[526]2814[2442]472[836]1.22[1.78]
Murmur297[103]532[415]65[104]165[106]162[111]434[383]622[566]2948[2399]537[834]1.25[1.74]
CRC-3290[101]565[426]55[64]198[107]192[94]427[338]590[563]2842[2400]469[725]1.26[1.41]
x17 unrolled93[109]593[415]52[24]214[113]208[102]434[368]593[589]2867[2392]486[829]1.30[1.19]
lookup394[101]565[412]70[97]189[101]182[95]432[361]631[550]2943[2392]572[834]1.32[1.65]
K&R93[106]619[437]58[288]221[94]218[106]442[360]587[561]2961[2365]447[831]1.33[3.00]
Larson95[99]631[416]49[16]231[99]228[105]455[366]599[583]3027[2447]469[755]1.35[1.10]
XXHfast32108[110]546[420]86[102]139[103]136[106]459[347]681[491]3259[2494]717[838]1.35[1.71]
Murmur3108[101]561[380]74[104]167[103]165[105]468[334]700[555]3259[2376]604[783]1.36[1.69]
Bernstein97[114]622[412]61[288]225[100]222[102]448[353]609[572]3053[2380]469[703]1.37[2.99]
XXHstrong32108[109]558[429]86[102]150[102]147[112]460[355]682[491]3262[2496]714[838]1.38[1.72]
x6559999[111]628[382]61[203]234[107]232[122]459[379]630[560]3097[2373]471[846]1.40[2.45]
Paul Hsieh106[114]576[420]82[118]183[101]178[100]456[341]678[600]3154[2380]670[847]1.41[1.83]
Sedgewick101[107]667[414]52[48]245[103]242[103]478[348]630[570]3204[2437]475[782]1.42[1.33]
Murmur2A113[114]598[433]78[102]183[112]178[109]488[365]719[544]3380[2369]651[772]1.44[1.73]
FNV-1a102[124]660[428]62[108]239[94]237[105]473[374]627[555]3140[2446]516[807]1.44[1.77]
MaPrime2c108[103]705[426]65[106]255[91]254[106]508[349]674[550]3413[2406]542[865]1.54[1.73]
Ramakrishna108[108]728[409]61[91]278[125]272[103]511[360]660[528]3378[2383]517[840]1.56[1.66]
Arash Partow106[101]739[435]93[420]280[98]275[85]514[355]671[570]3332[2372]543[779]1.65[3.88]
One At Time118[105]830[421]81[110]321[97]319[103]578[364]741[545]3809[2346]657[795]1.82[1.75]
Weinberger119[104]956[422]54[100]375[111]379[117]614[364]745[712]3973[2547]560[744]1.89[1.75]
Hanson86[118]531[649]55[112]168[118]1722[499]393[435]549[592]2742[2890]463[833]2.60[2.46]

Each cell includes the execution time, then the number of collisions in square brackets. Execution time is expressed in thousands of clock cycles (a lower number is better). Avg column contains the average normalized execution time (and the number of collisions).

The function by Kernighan and Ritchie is from their famous book "The C programming Language", 3rd edition; Weinberger's hash and the hash with multiplier 65599 are from the Red Dragon book. The latter function is used in gawk, sdbm, and other Linux programs. x17 is the function by Peter Kankowski (multiplier = 17; 32 is subtracted from each letter code).

As you can see from the table, the function with the lowest number of collisions is not always the fastest one.

Results on a large data set (list of all words in English Wikipedia, 12.5 million words, from the benchmark by Georgi 'Sanmayce'):

### Core i5 processor

WikipediaAvg
iSCSI CRC5725944[2077725]1.00[1.00]
Meiyan5829105[2111271]1.02[1.02]
Murmur26313466[2081476]1.10[1.00]
Larson6403975[2080111]1.12[1.00]
Murmur36492620[2082084]1.13[1.00]
x655996479417[2102893]1.13[1.01]
FNV-1a6599423[2081195]1.15[1.00]
SBox6964673[2084018]1.22[1.00]
Hanson7007689[2129832]1.22[1.03]
CRC-327016147[2075088]1.23[1.00]
Sedgewick7060691[2080640]1.23[1.00]
XXHfast327078804[2084164]1.24[1.00]
K&R7109841[2083145]1.24[1.00]
XXHstrong327168788[2084514]1.25[1.00]
Bernstein7247096[2074237]1.27[1.00]
lookup37342986[2084889]1.28[1.01]
Murmur2A7376650[2081370]1.29[1.00]
Paul Hsieh7387317[2180206]1.29[1.05]
x17 unrolled7410443[2410605]1.29[1.16]
Ramakrishna8172670[2093253]1.43[1.01]
One At Time8338799[2087861]1.46[1.01]
MaPrime2c8428492[2084467]1.47[1.00]
Arash Partow8503299[2084572]1.49[1.00]
Weinberger9416340[3541181]1.64[1.71]
Novak unrolled21289919[6318611]3.72[3.05]
Fletcher22235133[9063797]3.88[4.37]

### Pentium-M processor

WikipediaAvg
x17 unrolled11321744[2410605]1.00[1.16]
K&R11666050[2083145]1.03[1.00]
Bernstein11833902[2074237]1.05[1.00]
Larson11888751[2080111]1.05[1.00]
Sedgewick12111839[2080640]1.07[1.00]
x6559912144777[2102893]1.07[1.01]
Arash Partow12235396[2084572]1.08[1.00]
Ramakrishna12185834[2093253]1.08[1.01]
Meiyan12269691[2111271]1.08[1.02]
CRC-3212604152[2075088]1.11[1.00]
Murmur212713455[2081476]1.12[1.00]
SBox12716574[2084018]1.12[1.00]
Hanson12627597[2129832]1.12[1.03]
lookup312791917[2084889]1.13[1.01]
FNV-1a12868991[2081195]1.14[1.00]
Murmur312916960[2082084]1.14[1.00]
XXHfast3212936106[2084164]1.14[1.00]
XXHstrong3212950650[2084514]1.14[1.00]
Murmur2A13068746[2081370]1.15[1.00]
Paul Hsieh12992315[2180206]1.15[1.05]
MaPrime2c13348580[2084467]1.18[1.00]
One At Time13662010[2087861]1.21[1.01]
Weinberger14592843[3541181]1.29[1.71]
Fletcher37410790[9063797]3.30[4.37]
Novak unrolled37769882[6318611]3.34[3.05]

Some functions were excluded from the benchmark because of very bad performance:

• Adler-32 (slow filling, not suitable as a hash function);
• TwoChars (bad for machine-generated names and variable names that are similar to each other, disastrous for large data sets such as Wikipedia).

The number of collisions depending on the hash table size (for the same data set, thanks to Ace for the idea):

Red Dragon Book proposes the following formula for evaluating hash function quality:

where bj is the number of items in j-th slot, m is the number of slots, and n is the total number of items. The sum of bj(bj + 1) / 2 estimates the number of slots your program should visit to find the required value. The denominator (n / 2m)(n + 2m − 1) is the number of visited slots for an ideal function that puts each item into a random slot. So, if the function is ideal, the formula should give 1. In reality, a good function is somewhere between 0.95 and 1.05. If it's more, there is a high number of collisions (slow!). If it's less, the function gives less collisions than the randomly distributing function, which is not bad.

Here are the results for some of our functions:

## Conclusion

Complex functions by Paul Hsieh and Bob Jenkins are tuned for long keys, such as the ones in postfix and prefix tests. Note that they do not provide the best number of collisions for these tests, but do have the best time, which means that the functions are faster than the others because of loop unrolling. At the same time, they are suboptimal for short keys (words and sonnets tests).

For a word counting program, a compiler, or another application that typically handles short keys, it's often advantageous to use a simple multiplicative function such as x17 or Larson's hash. However, these functions perform badly on long keys.

Novak showed bad results on the large data set. Jesteress has a high number of collisions in numbers test.

Murmur2, Meiyan, SBox, and CRC32 provide good performance for all kinds of keys. They can be recommended as general-purpose hashing functions on x86.

Hardware-accelerated CRC (labeled iSCSI CRC in the table) is the fastest hash function on the recent Core i5/i7 processors. However, the CRC32 instruction is not supported by AMD and earlier Intel processors.

Download the source code (152 KB, MSVC++)

## Variations

### XORing high and low part

For table size less than 2^16, we can improve the quality of hash function by XORing high and low words, so that more letters will be taken into account:

   return hash ^ (hash >> 16);


### Subtracting a constant

x17 hash function subtracts a space from each letter to cut off the control characters in the range 0x00..0x1F. If the hash keys are long and contain only Latin letters and numbers, the letters will be less frequently shifted out, and the overall number of collisions will be lower. You can even subtract 'A' when you know that the keys will be only English words.

### Using larger multipliers for a compiler

Paul Hsieh noted that large multipliers may provide better results for the hash table in a compiler, because a typical source code contains a lot of one-letter variable names (i, j, s, etc.), and they will collide if the multiplier is less than the number of letters in the alphabet.

The test confirms this assumption: the function by Kernighan & Ritchie (M = 33) has lower number of collisions than x17 (M = 17), but the latter is still faster (see Variables column in the table above).

### Setting hash table size to a prime number

A test showed that the number of collisions will usually be lower if you use a prime, but the calculations modulo prime take much more time than the calculations for a power of 2, so this method is impractical. Even replacing division with multiplication by reciprocal values do not help here:

WordsWin32NumbersPrefixPostfixVariablesShakespeare
Bernstein % 2K145[261]880[889]426[8030]326[214]316[226]649[697]874[1131]
Bernstein % prime186[221]1049[995]445[5621]364[194]357[217]805[800]1123[1051]
Bernstein optimized mod160[221]960[995]416[5621]341[194]334[217]722[800]969[1051]
x17 % 2K137[193]847[1002]81[340]314[244]300[228]641[863]832[1012]
x17 % prime173[256]1010[1026]104[324]356[246]339[216]760[760]1046[1064]
x17 optimized mod155[256]915[1026]96[324]330[246]315[216]691[760]930[1064]

### Implementing open addressing vs. separate chaining

With open addressing, most hash functions show awkward clustering behavior in "Numbers" test:

Bernst.K&Rx17 unrollx65599FNVUnivWeinb.HsiehOne-atLookup3PartowCRC
OA4268184207889127311010392104279
[8030][20810][340][3158][207][480][4360][342][267][205][20860][96]
32-bit17969741148680125105999234782
[8030][20810][340][3158][207][480][4360][342][267][205][20860][96]
chain92687382888473107999514984
[500][500][24][258][124][48][100][138][131][108][1530][64]

You can avoid the worst case by using chaining for collision resolution. However, chaining requires more memory for the next item pointers, so the performance improvement does not come for free. A custom memory allocator should be usually written, because calling malloc() for a large number of small structures is suboptimal.

Some implementations (e.g., hash table in Python interpreter) store a full 32-bit hash with the item to speed up the string comparison, but this is less effective than chaining.

## About the author

Peter lives in Siberia, the land of sleeping sun, beautiful mountains, and infinitely deep snow. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP (on Windows platform). He can be reached at kankowski@narod.ru.

Created by Peter Kankowski
Last changed
Contributors: Nils, Ace, Won, Andrew M., and Georgi 'Sanmayce'

Ten recent comments are shown below. Show all comments

Georgi 'Sanmayce',

Hi guys,

I am happy to share my latest best.

A remainderless variant for 16[+] bytes keys appeared while I was playing with Yorikke and wanting to try an old idea of mine - to reduce the branching.

When the GOLDEN Yorikke was Interleaved&Interlaced a DIAMANT appeared - it's time for a new generation slasher: FNV1A_Yoshimura.

Up to now the TOP benchmark results I was able to gather:

On AMD (Phenom II X6 1600T, 4000MHz) FNV1A_YoshimitsuTRIAD reigns with 11.360MB per clock or 11360/1024 = 11.093GB/s.

On Intel (i7-3930K, 4500MHz) FNV1A_YoshimitsuTRIAD reigns with 14.928MB per clock or 14928/1024 = 14.578GB/s.



It is worth the attempt to explore the Jesteress-Yorikke (i.e. 1 hash line 4+4 vs 2 hash lines 4+4) 8-16 GAP in order to lessen their collisions further more.

Simply put, 3 hash lines 4 bytes each, 12 bytes per loop.

The 'non power of 2' workaround I see as one MONOLITH function with no remainder mixing at all.

The idea #1 is to exterminate all nasty IFs outwith the main loop, I believe such branchless etude will outperform Jesteress.

The idea #2 is to STRESS memory by fetching not-so-adjacent areas.

For example:

Key: hash_with_overlapping_aye_aye

Key left-to-right quadruplets and remainder: 'hash', '_wit', 'h_ov', 'erla', 'ppin', 'g_ay', 'e_ay', 'e'
Key right-to-left quadruplets and remainder: 'h', 'ash_', 'with', '_ove', 'rlap', 'ping', '_aye', '_aye'


Key_Length: 29

Loop_Counter: 3 //if ( Key_Length%(3*4) ) Loop_Counter = Key_Length/(3*4)+1; else Loop_Counter = Key_Length/(3*4);



Loop #1 of 3:

Hash line 1: hash

Hash line 2: h_ov

Hash line 3: ping

Loop #2 of 3:

Hash line 1: _wit

Hash line 2: erla

Hash line 3: _aye

Loop #3 of 3:

Hash line 1: h_ov

Hash line 2: ppin

Hash line 3: _aye

I don't know the internals, whether lines are 32/64/128 bytes long is a secondary concern.

Well, the key is too short, in reality the above key may span only 1|2 cache lines, if the key is longer than 4 cache lines (assuming 32) e.g. 128+2 bytes then it may span 5|6 lines.

My dummy/clueless premise is that is possible (in future systems) to access effectively RAM in such manner.

Does someone know whether such type of accessing has any practical value in nowadays CPUs?

Of course, it is worth trying to "interleave" in that way all the short string hashers, yes?

Anyway the 3 lines are in stack, for the time being let's see how 'INTERLEAVED' Yorikke, which I called FNV1A_Yoshimura, behaves.

// [North Star One-Sword School]
// - My name is Kanichiro Yoshimura.
//   I'm a new man. Just so you'll know who I am...
//   Saito-sensei.
// - What land are you from?
// - 'Land'?
// - Yes.
// - I was born in Morioka, in Nanbu, Oshu.
//   It's a beautiful place.
//   Away to the south is Mt Hayachine...
//   with Mt Nansho and Mt Azumane to the west.
//   In the north are Mt Iwate and Mt Himekami.
//   Out of the high mountains flows the Nakatsu River...
//   through the castle town into the Kitakami below Sakuranobaba.
//   Ah, it's pretty as a picture!
//   There's nowhere like it in all Japan!
// /Paragon Kiichi Nakai in the paragon piece-of-art 'The Wolves of Mibu' aka 'WHEN THE LAST SWORD IS DRAWN'/
// As I said on one Japanese forum, Kiichi Nakai deserves an award worth his weight in gold, nah-nah, in DIAMONDS!
uint32_t FNV1A_Hash_Yoshimura(const char *str, uint32_t wrdlen)
{
const uint32_t PRIME = 709607;
uint32_t hash32 = 2166136261;
uint32_t hash32B = 2166136261;
const char *p = str;
uint32_t Loop_Counter;
uint32_t Second_Line_Offset;

if (wrdlen >= 2*2*sizeof(uint32_t)) {
Second_Line_Offset = wrdlen-((wrdlen>>4)+1)*(2*4); // ((wrdlen>>1)>>3)
Loop_Counter = (wrdlen>>4);
//if (wrdlen%16) Loop_Counter++;
Loop_Counter++;
for(; Loop_Counter; Loop_Counter--, p += 2*sizeof(uint32_t)) {
// revision 1:
//hash32 = (hash32 ^ (_rotl(*(uint32_t *)(p+0),5) ^ *(uint32_t *)(p+4))) * PRIME;
//hash32B = (hash32B ^ (_rotl(*(uint32_t *)(p+0+Second_Line_Offset),5) ^ *(uint32_t *)(p+4+Second_Line_Offset))) * PRIME;
// revision 2:
hash32 = (hash32 ^ (_rotl(*(uint32_t *)(p+0),5) ^ *(uint32_t *)(p+0+Second_Line_Offset))) * PRIME;
hash32B = (hash32B ^ (_rotl(*(uint32_t *)(p+4+Second_Line_Offset),5) ^ *(uint32_t *)(p+4))) * PRIME;
}
} else {
// Cases: 0,1,2,3,4,5,6,7,...,15
if (wrdlen & 2*sizeof(uint32_t)) {
hash32 = (hash32 ^ *(uint32_t*)(p+0)) * PRIME;
hash32B = (hash32B ^ *(uint32_t*)(p+4)) * PRIME;
p += 4*sizeof(uint16_t);
}
// Cases: 0,1,2,3,4,5,6,7
if (wrdlen & sizeof(uint32_t)) {
hash32 = (hash32 ^ *(uint16_t*)(p+0)) * PRIME;
hash32B = (hash32B ^ *(uint16_t*)(p+2)) * PRIME;
p += 2*sizeof(uint16_t);
}
if (wrdlen & sizeof(uint16_t)) {
hash32 = (hash32 ^ *(uint16_t*)p) * PRIME;
p += sizeof(uint16_t);
}
if (wrdlen & 1)
hash32 = (hash32 ^ *p) * PRIME;
}
hash32 = (hash32 ^ _rotl(hash32B,5) ) * PRIME;
return hash32 ^ (hash32 >> 16);
}



To reproduce the quick-test below here comes: http://www.sanmayce.com/Fastest_Hash/DOUBLOON_hash_micro-package_r2.zip

The results on my 'Bonboniera' T7500, throwing mostly 16+ long keys at the "awful greedy country samurai":

E:\Night_Light_Sky_hash_package_r1+\DOUBLOON_hash_micro-package_r2>RUNME.BAT
E:\Night_Light_Sky_hash_package_r1+\DOUBLOON_hash_micro-package_r2>type Results.txt



Intel 12.1:

8192 elements in the table (13 bits)

           Jesteress: 493 [  576]
Meiyan: 515 [  583]
Yorikke: 458 [  579]
Yoshimura: 379 [  593] !!! SIGNIFICANTLY fastEST !!!
Yoshimitsu: 497 [  609]
YoshimitsuTRIAD: 489 [  615]
FNV-1a: 969 [  604]
Larson: 947 [  581]
CRC-32: 894 [  613]
Murmur2: 656 [  600]
Murmur3: 711 [  583]
XXHfast32: 504 [  596]
XXHstrong32: 528 [  571]


2048 elements in the table (11 bits)

           Jesteress:  268 [  205]
Meiyan:  268 [  205]
Yorikke:  224 [  207]


Yoshimura: 235 [ 187] ??? the slowest of all the four Yo*, something to ponder about ???

          Yoshimitsu:  225 [  225]
YoshimitsuTRIAD:  221 [  219]
FNV-1a: 1125 [  225]
Larson: 1131 [  212]
CRC-32:  919 [  230]
Murmur2:  439 [  222]
Murmur3:  497 [  223]
XXHfast32:  250 [  223]
XXHstrong32:  309 [  192]


65536 elements in the table (16 bits)

           Jesteress: 12249 [ 6883]
Meiyan: 12369 [ 6897]
Yorikke: 11000 [ 6872]
Yoshimura:  9876 [ 6908] !!! fastEST, yet with high collisions !!!
Yoshimitsu: 11489 [ 6937]
YoshimitsuTRIAD: 11094 [ 6843]
FNV-1a: 39491 [ 6840]
Larson: 39714 [ 6889]
CRC-32: 34264 [ 6891]
Murmur2: 17678 [ 6786]
Murmur3: 19626 [ 6850]
XXHfast32: 10383 [ 6859]
XXHstrong32: 12708 [ 6887]



Microsoft 16:

8192 elements in the table (13 bits)

           Jesteress:  756 [  576]
Meiyan:  781 [  583]
Yorikke:  776 [  579]
Yoshimura:  740 [  593]
Yoshimitsu:  781 [  609]
YoshimitsuTRIAD:  803 [  615]
FNV-1a: 1306 [  604]
Larson: 1304 [  581]
CRC-32: 1204 [  613]
Murmur2:  983 [  600]
Murmur3: 1031 [  583]
XXHfast32:  859 [  596]
XXHstrong32:  883 [  571]


2048 elements in the table (11 bits)

           Jesteress:  463 [  205]
Meiyan:  464 [  205]
Yorikke:  422 [  207]
Yoshimura:  442 [  187]
Yoshimitsu:  431 [  225]
YoshimitsuTRIAD:  423 [  219]
FNV-1a: 1311 [  225]
Larson: 1319 [  212]
CRC-32: 1148 [  230]
Murmur2:  648 [  222]
Murmur3:  637 [  223]
XXHfast32:  451 [  223]
XXHstrong32:  496 [  192]


65536 elements in the table (16 bits)

           Jesteress: 20162 [ 6883]
Meiyan: 20124 [ 6897]
Yorikke: 19101 [ 6872]
Yoshimura: 17801 [ 6908]
Yoshimitsu: 19616 [ 6937]
YoshimitsuTRIAD: 19370 [ 6843]
FNV-1a: 47142 [ 6840]
Larson: 48009 [ 6889]
CRC-32: 42964 [ 6891]
Murmur2: 25741 [ 6786]
Murmur3: 25654 [ 6850]
XXHfast32: 18179 [ 6859]
XXHstrong32: 20557 [ 6887]



Mr.Norton you are most welcome, maybe you have already spotted that I don't target 64bit stamps at all, yet, if you are interested I can write r.3 of my 'Tesla' function using these (Interleaving & Interlacing) nifty techniques - I believe it will outperform itself being the fastest 64bit hash in m^2 testbench.

Georgi 'Sanmayce',

I thank Przemyslaw Skibinski and Maciej Adamczyk (m^2) for their 64bit testbench which I included along with yours (Peter) in the benchmark:

http://www.sanmayce.com/Fastest_Hash/DOUBLOON_hash_micro-package_r3.zip

Results below are for the Intel 12.1 32bit executable on my laptop:

// hash_I 16KB_as_one_line.TXT:


4 elements in the table (2 bits)

           Jesteress:  17 [    0]
Meiyan:  17 [    0]
Yorikke:  10 [    0]
Yoshimura:  10 [    0]
Yoshimitsu:   9 [    0]
YoshimitsuTRIAD:  10 [    0]
FNV-1a: 131 [    0]
Larson: 131 [    0]
CRC-32: 102 [    0]
Murmur2:  35 [    0]
Murmur3:  42 [    0]
XXHfast32:  11 [    0]
XXHstrong32:  21 [    0]
...
// hash_I 100MB_as_one_line.TXT:


4 elements in the table (2 bits)

           Jesteress: 128304 [    0]
Meiyan: 128253 [    0]
Yorikke: 108994 [    0]
Yoshimura:  94794 [    0]
Yoshimitsu: 101738 [    0]
YoshimitsuTRIAD: 104999 [    0]
FNV-1a: 850704 [    0]
Larson: 853118 [    0]
CRC-32: 663362 [    0]
Murmur2: 239559 [    0]
Murmur3: 287138 [    0]
XXHfast32: 104313 [    0]
XXHstrong32: 153782 [    0]
// hash_I 200MB_as_one_line.TXT:


4 elements in the table (2 bits)

           Jesteress:  257028 [    0]
Meiyan:  256955 [    0]
Yorikke:  218383 [    0]
Yoshimura:  190907 [    0]
Yoshimitsu:  203801 [    0]
YoshimitsuTRIAD:  210256 [    0]
FNV-1a: 1702024 [    0]
Larson: 1709548 [    0]
CRC-32: 1327211 [    0]
Murmur2:  479303 [    0]
Murmur3:  574363 [    0]
XXHfast32:  208887 [    0]
XXHstrong32:  307942 [    0]



I wrote revision 3 of FNV1A_Tesla as 64bit counterpart of FNV1A_Yoshimura and included them into the 64bit linear speed test by Przemyslaw and Maciej, the results are (I threw the 200MB at the hashers):

As console screenshots:

http://www.sanmayce.com/Fastest_Hash/64bit_page1_.png

http://www.sanmayce.com/Fastest_Hash/64bit_page2_.png

As console text dumps:

E:\DOUBLOON_hash_micro-package_r3>RUNME_64bit.BAT

E:\DOUBLOON_hash_micro-package_r3>benchmark_Intel_12.1_O2.exe CityHash128 CityHash64 SpookyHash fnv1a-jesteress fnv1a-yoshimura fnv1a-tesla3 xxhash-fast xxhash-strong xxhash256 -i77 200MB_as_one_line.TXT
memcpy: 108 ms, 209715202 bytes = 1851 MB/s
Codec                                   version      args
C.Size      (C.Ratio)        C.Speed   D.Speed      C.Eff. D.Eff.
CityHash128                             1.0.3
209715218 (x 1.000)      3333 MB/s 3333 MB/s      273e15 273e15
CityHash64                              1.0.3
209715210 (x 1.000)      3333 MB/s 3389 MB/s      273e15 277e15
SpookyHash                              2012-03-30
209715218 (x 1.000)      4081 MB/s 4081 MB/s      334e15 334e15
fnv1a-jesteress                         v2
209715206 (x 1.000)      3333 MB/s 3333 MB/s      273e15 273e15
fnv1a-yoshimura                         v2
209715206 (x 1.000)      4166 MB/s 4166 MB/s      341e15 341e15
fnv1a-tesla3                            v2
209715210 (x 1.000)      4347 MB/s 4347 MB/s      356e15 356e15
xxhash-fast                             r3
209715206 (x 1.000)      4000 MB/s 4000 MB/s      327e15 327e15
xxhash-strong                           r3
209715206 (x 1.000)      2816 MB/s 2816 MB/s      230e15 230e15
xxhash256                               r3
209715234 (x 1.000)      4166 MB/s 4166 MB/s      341e15 341e15
Codec                                   version      args
C.Size      (C.Ratio)        C.Speed   D.Speed      C.Eff. D.Eff.
done... (77x1 iteration(s)).

E:\DOUBLOON_hash_micro-package_r3>benchmark_Intel_12.1_O3.exe CityHash128 CityHash64 SpookyHash fnv1a-jesteress fnv1a-yoshimura fnv1a-tesla3 xxhash-fast xxhash-strong xxhash256 -i77 200MB_as_one_line.TXT
memcpy: 109 ms, 209715202 bytes = 1834 MB/s
Codec                                   version      args
C.Size      (C.Ratio)        C.Speed   D.Speed      C.Eff. D.Eff.
CityHash128                             1.0.3
209715218 (x 1.000)      3278 MB/s 3333 MB/s      268e15 273e15
CityHash64                              1.0.3
209715210 (x 1.000)      3333 MB/s 3278 MB/s      273e15 268e15
SpookyHash                              2012-03-30
209715218 (x 1.000)      3278 MB/s 3278 MB/s      268e15 268e15
fnv1a-jesteress                         v2
209715206 (x 1.000)      3333 MB/s 3333 MB/s      273e15 273e15
fnv1a-yoshimura                         v2
209715206 (x 1.000)      3921 MB/s 3921 MB/s      321e15 321e15
fnv1a-tesla3                            v2
209715210 (x 1.000)      4166 MB/s 4166 MB/s      341e15 341e15
xxhash-fast                             r3
209715206 (x 1.000)      3636 MB/s 3636 MB/s      297e15 297e15
xxhash-strong                           r3
209715206 (x 1.000)      2777 MB/s 2777 MB/s      227e15 227e15
xxhash256                               r3
209715234 (x 1.000)      3773 MB/s 3773 MB/s      309e15 309e15
Codec                                   version      args
C.Size      (C.Ratio)        C.Speed   D.Speed      C.Eff. D.Eff.
done... (77x1 iteration(s)).

E:\DOUBLOON_hash_micro-package_r3>benchmark_Intel_12.1_fast.exe CityHash128 CityHash64 SpookyHash fnv1a-jesteress fnv1a-yoshimura fnv1a-tesla3 xxhash-fast xxhash-strong xxhash256 -i77 200MB_as_one_line.TXT
memcpy: 110 ms, 209715202 bytes = 1818 MB/s
Codec                                   version      args
C.Size      (C.Ratio)        C.Speed   D.Speed      C.Eff. D.Eff.
CityHash128                             1.0.3
209715218 (x 1.000)      2380 MB/s 2380 MB/s      195e15 195e15
CityHash64                              1.0.3
209715210 (x 1.000)      2105 MB/s 2105 MB/s      172e15 172e15
SpookyHash                              2012-03-30
209715218 (x 1.000)      3508 MB/s 3508 MB/s      287e15 287e15
fnv1a-jesteress                         v2
209715206 (x 1.000)      3389 MB/s 3389 MB/s      277e15 277e15
fnv1a-yoshimura                         v2
209715206 (x 1.000)      4000 MB/s 4000 MB/s      327e15 327e15
fnv1a-tesla3                            v2
209715210 (x 1.000)      4255 MB/s 4255 MB/s      348e15 348e15
xxhash-fast                             r3
209715206 (x 1.000)      3773 MB/s 3773 MB/s      309e15 309e15
xxhash-strong                           r3
209715206 (x 1.000)      2777 MB/s 2777 MB/s      227e15 227e15
xxhash256                               r3
209715234 (x 1.000)      3921 MB/s 3921 MB/s      321e15 321e15
Codec                                   version      args
C.Size      (C.Ratio)        C.Speed   D.Speed      C.Eff. D.Eff.
done... (77x1 iteration(s)).

E:\DOUBLOON_hash_micro-package_r3>benchmark_Microsoft_VS2010_Ox.exe CityHash128 CityHash64 SpookyHash fnv1a-jesteress fnv1a-yoshimura fnv1a-tesla3 xxhash-fast xxhash-strong xxhash256 -i77 200MB_as_one_line.TXT
memcpy: 111 ms, 209715202 bytes = 1801 MB/s
Codec                                   version      args
C.Size      (C.Ratio)        C.Speed   D.Speed      C.Eff. D.Eff.
CityHash128                             1.0.3
209715218 (x 1.000)      4444 MB/s 4444 MB/s      364e15 364e15
CityHash64                              1.0.3
209715210 (x 1.000)      4255 MB/s 4255 MB/s      348e15 348e15
SpookyHash                              2012-03-30
209715218 (x 1.000)      4081 MB/s 4081 MB/s      334e15 334e15
fnv1a-jesteress                         v2
209715206 (x 1.000)      3333 MB/s 3278 MB/s      273e15 268e15
fnv1a-yoshimura                         v2
209715206 (x 1.000)      4166 MB/s 4166 MB/s      341e15 341e15
fnv1a-tesla3                            v2
209715210 (x 1.000)      4347 MB/s 4347 MB/s      356e15 356e15
xxhash-fast                             r3
209715206 (x 1.000)      4255 MB/s 4255 MB/s      348e15 348e15
xxhash-strong                           r3
209715206 (x 1.000)      2857 MB/s 2857 MB/s      234e15 234e15
xxhash256                               r3
209715234 (x 1.000)      4255 MB/s 4255 MB/s      348e15 348e15
Codec                                   version      args
C.Size      (C.Ratio)        C.Speed   D.Speed      C.Eff. D.Eff.
done... (77x1 iteration(s)).

E:\DOUBLOON_hash_micro-package_r3>



FNV1A_Yoshimura is simply DIAMANTINE.

Bulat Ziganshin,

VMAC-64, cryptographically strong keyed hash, runs at x64 with 4 bytes/cycle speed. i.e. 16 GB/s

Peter Kankowski,

Thank you. VMAC-64 is a cryptographic hash function, not a function for hash tables. Most likely, it will be slow on the short strings that are used in my benchmark. You are welcomed to benchmark it and post the results here.

Georgi 'Sanmayce',

For one, Bulat's point is from data deduplication context, am I right?

There VMAC(VHASH) IS a function for hash tables, where keys can be chunks several KB long.

Though not "natively" targeted for HT it could be used as such, similarly to CRC32 not a "native" HT function but a checksum.

Looking at VMAC(VHASH)'s main loop (the version below does 64-bytes of message at a time):

    for (i = 0; i < nw; i+= 8) {
MUL64(th,tl,get64PE((mp)+i  )+(kp)[i  ],get64PE((mp)+i+1)+(kp)[i+1]);
MUL64(th,tl,get64PE((mp)+i+2)+(kp)[i+2],get64PE((mp)+i+3)+(kp)[i+3]);
MUL64(th,tl,get64PE((mp)+i+4)+(kp)[i+4],get64PE((mp)+i+5)+(kp)[i+5]);
MUL64(th,tl,get64PE((mp)+i+6)+(kp)[i+6],get64PE((mp)+i+7)+(kp)[i+7]);
}



It looks elegant, straight-forward and with potential to be bettered, yet, these 4 64x64 MULs are STILL scary.

As for those 4 bytes/cycle (on i7?!) they are kick-ass but in that respect 'xxhash256' kicks its ass quickly:

On Core 2 2.33GHz in L2 cache xxhash256 hashes at 11.01GB/s or 5+ B/c, but this is only a warm-up, on Phenom II X4 955, 3.2GHz (4 threads) in L1 cache xxhash256 hashes at 73.98GB/s or 6+ B/c = (73.98*1024*1024*1024)/(4*3.2*1000*1000*1000), OUCH!



Well, xxhash256 is a monster, but there are supermonsters as well, in the name of reaching higher Bytes-Per-Cycle let us see what a Core i3 laptop can achieve in L3/L2/L1 using XMM registers:

Core i3 2310M 2.1GHz laptop:

FNV1A_penumbra, (2MB block): 15593MB/s or (15593*1024*1024)/(2.1*1000*1000*1000)=7.7B/c
FNV1A_penumbra, (128KB block): 19692MB/s or (19692*1024*1024)/(2.1*1000*1000*1000)=9.8B/c
FNV1A_penumbra, (16KB block): 19877MB/s or (19877*1024*1024)/(2.1*1000*1000*1000)=9.9B/c



For reference my Core 2 T7500 2.2GHz laptop gave:

FNV1A_penumbra, (16KB block): 11262MB/s or (11262*1024*1024)/(2.2*1000*1000*1000)=5.3B/c



Seeing that 'Everest' gives 67011MB/s for L1 cache Read on this i3 and comparing it to 19877MB/s results in only 237% non-utilization.

Brutal improvement in XMM department over the preprevious generation (i3 being of 2nd), 8GB/s - huge difference indeed.

FNV1A_penumbra simply combines the short (192- bytes) and "long" 192[+] bytes long keys under one roof, first hashed by FNV1a-YoshimitsuTRIADii while the second by an unrolled XMM FNV1a whopper.

And one more thing, dispersion, a FNV1A-YoshimitsuTRIADii vs VHASH showdown would be interesting.

Currently I am hashing (using 27bit HT) 1+ trillion a la Knight-Tour 128 byte long keys, 220h later the result is:

FNV1A_YoshimitsuTRIADii: KT_DumpCounter = 0,629,883,797,505; 000,000,001 x MAXcollisionsAtSomeSlots = 005,065; HASHfreeSLOTS = 0,000,000,000
CRC32 0x8F6E37A0, iSCSI: KT_DumpCounter = 0,629,883,797,505; 000,000,002 x MAXcollisionsAtSomeSlots = 005,072; HASHfreeSLOTS = 0,000,000,000

Where keys:slots ratio is 629,883,797,505/134,217,727 = 4693 or CRC32/FNV1A_YoshimitsuTRIADii show only 8% deviation from "ideal" distribution.



In my view each and every 'text message' HT function should be tortured with this 128 (neither long nor short) key length.

m^2,

I run a quick slash_hash test with SMHasher:

https://gist.github.com/anonymous/5926294

Not great.

Ace,

Safe against DoS attacks, by Jean-Philippe Aumasson and Daniel J. Bernstein:

https://131002.net/siphash/

Another mentioned:

Ace,

And the set of tests maintained by Murmur authors: http://code.google.com/p/smhasher/

Peter Kankowski,

Thank you, SipHash looks interesting. I will include it in the benchmarks.

ace,

http://blog.booking.com/hardening-perls-hash-function.html

"Analysis done by the Perl5 Security Team suggests that One-At-A-Time-Hash is intrinsically more secure than MurmurHash. However to my knowledge, there is no peer-reviewed cryptanalysis to prove it.

There seems to be very little research into fast, robust, hash algorithms which are suitable for dynamic languages. Siphash is a notable exception and a step forward, but within the Perl community there seems to be consensus that it is currently too slow, at least in the recommended Siphash-2-4 incarnation. It is also problematic that its current implementation only supports 64 bit architectures. (No doubt this will improve over time, or perhaps even already has.)"