Hash functions: An empirical comparison

Benchmark program for hash tables and comparison of 15 popular hash functions.

with help and advice from Nils, Ace, and Won

Hash tables are popular data structures for storing key-value pairs. A hash function is used to map the key value (usually a string) to array index. The functions are different from cryptographic hash functions, because they should be much faster and don't need to be resistant to preimage attack. There are two classes of the functions used in hash tables:

  • multiplicative hash functions, which are simple and fast, but have a high number of collisions;
  • more complex functions, which have better quality, but take more time to calculate.

Hash table benchmarks usually include theoretical metrics such as the number of collisions or distribution uniformity (see, for example, hash function comparison in the Red Dragon book). Obviously, you will have a better distribution with more complex functions, so they are winners in these benchmarks.

The question is whether using complex functions gives you a faster program. The complex functions require more operations per one key, so they can be slower. Is the price of collisions high enough to justify the additional operations?

Multiplicative hash functions

Any multiplicative hash function is a special case of the following algorithm:

UINT HashMultiplicative(const CHAR *key, SIZE_T len) {
   UINT hash = INITIAL_VALUE;
   for(UINT i = 0; i < len; ++i)
      hash = M * hash + key[i];
   return hash % TABLE_SIZE;
}

(Sometimes XOR operation is used instead of addition, but it does not make much difference.) The hash functions differ only by values of INITIAL_VALUE and multiplier (M). For example, the popular Bernstein's function uses INITIAL_VALUE of 5381 and M of 33; Kernighan and Ritchie's function uses INITIAL_VALUE of 0 and M of 31.

A multiplicative function works by adding together the letters weighted by powers of multiplier. For example, the hash for the word TONE will be:

INITIAL_VALUE * M^4  +  'T' * M^3  +  'O' * M^2  +  'N' * M  +  'E'

Let's enter several similar strings and watch the output of the functions:

     Bernstein Kernighan
        (M=33)     (M=31)
 too   b88af17     1c154
 top   b88af18     1c155
 tor   b88af1a     1c157
 tpp   b88af39     1c174
a000  7c9312d6    2cd22f
a001  7c9312d7    2cd230
a002  7c9312d8    2cd231
a003  7c9312d9    2cd232
a004  7c9312da    2cd233
a005  7c9312db    2cd234
a006  7c9312dc    2cd235
a007  7c9312dd    2cd236
a008  7c9312de    2cd237
a009  7c9312df    2cd238
a010  7c9312f7    2cd24e
   a     2b606        61
  aa    597727       c20
 aaa   b885c68     17841

Too and top are different in the last letter only. The letter P is the next one after O, so the values of hash function are different by 1 (1c154 and 1c155, b88af17 and b88af18). Ditto for a000..a009.

Now let's compare top with tpp. Their hashes will be:

INITIAL_VALUE * M^3 + 'T' * M^2 + 'O' * M + 'P'
INITIAL_VALUE * M^3 + 'T' * M^2 + 'P' * M + 'P'

The hashes will be different by M * ('P' - 'O') = M. Similarly, when the first letters are different by x, their hashes will be different by x * M^2.

When there are less than 33 possible letters, Bernstein's function will pack them into a number (similar to Radix40 packing scheme). For example, hash table of size 333 will provide perfect hashing (without any collisions) for all three-letter English words written in small letters. In practice, the words are longer and hash tables are smaller, so there will be some collisions (situations when different strings have the same hash value).

If the string is too long to fit into the 32-bit number, the first letters will still affect the value of the hash function, because the multiplication is done modulo 2^32 (in a 32-bit register), and the multiplier is chosen to have no common divisors with 2^32 (in other words, it must be odd), so the bits will not be just shifted away.

There are no exact rules for choosing the multiplier, only some heuristics:

  • the multiplier should be large enough to accommodate most of the possible letters (e.g., 3 or 5 is too small);
  • the multiplier should be fast to calculate with shifts and additions [e.g., 33 * hash can be calculated as (hash « 5) + hash];
  • the multiplier should be odd for the reason explained above;
  • prime numbers are good multipliers.

Complex hash functions

These functions do a good job of mixing together the bits of the source word. The change in one input bit changes a half of the bits in the output (see Avalanche_effect), so the result looks completely random:

     Paul Hsieh One At Time
 too   3ad11d33  3a9fad1e  
 top   78b5a877  4c5dd09a  
 tor   c09e2021  f2aa9d35  
 tpp   3058996d  d5e9e480  
a000   7552599f  ed3859d8  
a001   3cc1d896  fef7fd57  
a002   c6ff5c9b  08a610b3  
a003   dcab7b0c  1a88b478  
a004   780c7202  3621ebaa  
a005   7eb63e3a  47db8f1d  
a006   6b0a7a17  b901717b  
a007   cb5cb1ab  caec1550  
a008   5c2a15c0  e58d4a92  
a009   33339829  f75aee2d  
a010   eb1f336e  bd097a6b  
   a   115ea782  ca2e9442  
  aa   008ad357  7081738e  
 aaa   7dfdc310  ae4f22ec

To achieve this behavior, the hash functions perform a lot of shifts, XORs, and additions. But do we need a complex function? What is faster: tolerating the collisions and resolving them with chaining, or avoiding them with a more complex function?

Test conditions

The benchmark uses separate chaining algorithm for collision resolution. Memory allocation and other "heavy" functions were excluded from the benchmarked code. The RDTSC instruction was used for benchmarking. The test was performed on a Pentium-M processor.

The benchmark inserts some keys in the table, then looks them up in the same order as they were inserted. The test data include:

  • the list of common words from Wiktionary (500 items);
  • the list of Win32 functions from Colorer syntax highlight scheme (1992 items);
  • 500 names from a000 to a499 (imitates the names in auto-generated source code);
  • the list of common words with a long prefix and postfix;
  • all variable names from WordPress 2.3.2 source code in wp-includes folder (1842 names);
  • list of all words in Sonnets by W. Shakespeare (imitates a word counting program; 3228 words).

Results

WordsWin32NumbersPrefixPostfixVariablesShakespeare
Bernstein 144 [135] 889 [478] 92 [500] 325 [116] 320 [121] 658 [391] 885 [646]
K&R 145 [117] 883 [511] 88 [500] 321 [113] 316 [117] 659 [411] 899 [641]
x17 unrolled 135 [99] 842 [491] 73 [24] 303 [107] 298 [113] 636 [434] 855 [638]
x65599 139 [129] 857 [432] 82 [258] 319 [119] 314 [146] 641 [440] 862 [628]
FNV-1a 157 [154] 975 [528] 88 [124] 366 [106] 362 [118] 711 [443] 955 [640]
Sedgewick 153 [126] 963 [477] 84 [48] 366 [113] 361 [119] 703 [404] 947 [627]
Weinberger 169 [125] 1214 [495] 73 [100] 474 [138] 472 [152] 835 [421] 1052 [892]
Paul Larson 139 [117] 880 [470] 64 [16] 324 [119] 318 [117] 651 [421] 864 [656]
Paul Hsieh 157 [130] 853 [476] 107 [138] 282 [120] 275 [113] 677 [399] 991 [676]
One At Time 160 [118] 1018 [481] 99 [131] 383 [120] 376 [125] 745 [422] 990 [601]
lookup3 150 [114] 840 [474] 95 [108] 290 [118] 282 [106] 657 [412] 958 [619]
Arash Partow 154 [120] 1008 [510] 149 [1530] 376 [123] 367 [97] 731 [402] 951 [643]
CRC-32 154 [114] 986 [519] 84 [64] 371 [125] 362 [108] 720 [385] 958 [632]
Ramakrishna 151 [130] 971 [476] 80 [100] 370 [153] 360 [124] 705 [414] 925 [603]
Fletcher 129 [158] 690 [467] 190 [2880] 234 [155] 221 [124] 581 [732] 875 [1453]
Murmur2 140 [122] 794 [469] 84 [119] 262 [133] 257 [131] 633 [445] 874 [654]

Each cell includes the execution time, then the number of collisions in square brackets. Execution time is expressed in thousands of clock cycles (a lower number is better). The 3 best results in each test are highlighted with bold typeface.

The function by Kernighan and Ritchie is from their famous book "The C programming Language", 3rd edition; Weinberger's hash and the hash with multiplier 65599 are from the Red Dragon book. The latter function is used in gawk, sdbm, and other Linux programs. x17 is the function by Peter Kankowski (multiplier = 17; 32 is subtracted from each letter code).

As you can see from the table, the function with the lowest number of collisions is not always the fastest one. For example, compare CRC-32 and Larson's hash in the "numbers" test.

Conclusion

Complex functions by Paul Hsieh and Bob Jenkins are tuned for long keys, such as the ones in postfix and prefix tests. Note that they do not provide the best number of collisions for these tests, but do have the best time, which means that the functions are faster than the others because of loop unrolling. At the same time, they are suboptimal for short keys ("common words" and "Shakespeare" tests).

For a word counting program, a compiler, or another application that typically handles short keys, it's often advantageous to use a simple multiplicative function such as x17 or Larson's hash. However, these functions perform badly on long keys.

Murmur2 is the only of the complex hash functions that provides good performance for all kinds of keys. It can be recommended as a general-purpose hashing function.

Download the source code (80 KB, MSVC++)

Variations

XORing high and low part

For table size less than 2^16, we can improve the quality of hash function by XORing high and low words, so that more letters will be taken into account:

   return hash ^ (hash >> 16);

Subtracting a constant

x17 hash function subtracts a space from each letter to cut off the control characters in the range 0x00..0x1F. If the hash keys are long and contain only Latin letters and numbers, the letters will be less frequently shifted out, and the overall number of collisions will be lower. You can even subtract 'A' when you know that the keys will be only English words.

Using larger multipliers for a compiler

Paul Hsieh noted that large multipliers may provide better results for the hash table in a compiler, because a typical source code contains a lot of one-letter variable names (i, j, s, etc.), and they will collide if the multiplier is less than the number of letters in the alphabet.

The test confirms this assumption: the function by Kernighan & Ritchie (M = 33) has lower number of collisions than x17 (M = 17), but the latter is still faster (see Variables column in the table above).

Setting hash table size to a prime number

A test showed that the number of collisions will usually be lower if you use a prime, but the calculations modulo prime take much more time than the calculations for a power of 2, so this method is impractical. Even replacing division with multiplication by reciprocal values do not help here:

WordsWin32NumbersPrefixPostfixVariablesShakespeare
Bernstein % 2K 145 [261] 880 [889] 426 [8030] 326 [214] 316 [226] 649 [697] 874 [1131]
Bernstein % prime 186 [221] 1049 [995] 445 [5621] 364 [194] 357 [217] 805 [800] 1123 [1051]
Bernstein optimized mod 160 [221] 960 [995] 416 [5621] 341 [194] 334 [217] 722 [800] 969 [1051]
x17 % 2K 137 [193] 847 [1002] 81 [340] 314 [244] 300 [228] 641 [863] 832 [1012]
x17 % prime 173 [256] 1010 [1026] 104 [324] 356 [246] 339 [216] 760 [760] 1046 [1064]
x17 optimized mod 155 [256] 915 [1026] 96 [324] 330 [246] 315 [216] 691 [760] 930 [1064]

Implementing open addressing vs. separate chaining

With open addressing, most hash functions show awkward clustering behavior in "Numbers" test:

Bernst.K&Rx17 unrollx65599FNVUnivWeinb.HsiehOne-atLookup3PartowCRC
OA 426 81 84 207 88 91 273 110 103 92 1042 79
[8030] [20810] [340] [3158] [207] [480] [4360] [342] [267] [205] [20860] [96]
32-bit 179 69 74 114 86 80 125 105 99 92 347 82
[8030] [20810] [340] [3158] [207] [480] [4360] [342] [267] [205] [20860] [96]
chain 92 68 73 82 88 84 73 107 99 95 149 84
[500] [500] [24] [258] [124] [48] [100] [138] [131] [108] [1530] [64]

You can avoid the worst case by using chaining for collision resolution. However, chaining requires more memory for the next item pointers, so the performance improvement does not come for free. A custom memory allocator should be usually written, because calling malloc() for a large number of small structures is suboptimal.

Some implementations (e.g., hash table in Python interpreter) store a full 32-bit hash with the item to speed up the string comparison, but this is less effective than chaining.

Discussion

nils, 2008/01/28 15:32
Hi.

Have you tried to compare the hash-functions against CRC32? That would be interesting!

Some DSP's can already do galois multiplies which is the slow part of CRC. For PC we'll have cheap CRC in the future when the new SSE becomes mainstream (hopefully).

Cool blob btw..
Peter Kankowski, 2008/01/29 03:35
CRC implementation on x86 is slower than the other hash functions (see Paul Hsieh's tests). With SSE4, it should be faster, and it would be interesting to compare them in future. Let's wait for SSE4 :).
ace, 2008/01/30 19:29
As far as I know, the authors of the traditional hash functions that you presented made them under the assumption that the size of the table is a prime, not some "round" number like 1024. They counted on the modulo step to spread the resulting values. So when you use 1024 as the table size, of course their functions don't fare good. Can you try the table sizes that they would use and share these results with us?
Peter Kankowski, 2008/01/31 03:36
Thank you for the idea, I will test it this weekend (and also CRC function that Nils proposed). However, modulus of prime number will be much slower than (hash % 1024), which is optimized to (hash & 1023), so I'm not sure which one will be faster.
Arash Partow, 2008/02/03 00:38
Hi Peter nice article wanted to mention a few things though:

Most of the function presented there produce entropy of some level at a quantity of 32-bits. What you are doing by mod'ing by 1024 is ignoring the 22-bit higher bits. You should integrate them back into the result somehow.

A true test would not quantize the values. Instead it would just make a list of already generated values and see if those generated values occur again within the period. The period being the size of the "common words" etc.

Another good test is to see the avalanching abilities of the functions. In this case you only change 1 bit in the input and then see how many of the output bits change. The average should be close to about half the number of output bits(eg: 16 bits in the case of 32-bit outputs)

Another good thing to remember is to use random values, than only English or some such as you will see that in the case of English only certain bits in a byte are likely to change, how many times do we use control characters in English words?

I believe if you follow the above you may see different results...
acd, 2008/02/03 15:15
Arash, I think that Peter didn't want to investigate "avalanching abilities" or something like that -- his goal was to evaluate the functions on the "good enough" principle in some specific example.
Peter Kankowski, 2008/02/04 14:53
Arash, for multiplicative functions, the tests mix together higher and lower 16 bits with XOR, so higher bits are not ignored.

My purpose was to benchmark the hash functions in close-to-real-life scenario. Theoretical tests will give a completely different result.

For example, cryptographic hashes such as MD5 will give a perfect distribution, but they are impractically slow for hash tables. Good results in a theoretical test not always mean a fast hash function.
Mark Dennehy, 2008/04/02 03:29
Have you reviewed “Performance in Practise of String Hashing Functions” (Jobel & Ramakrishna, 1997) at all? They have some interesting data on performance of various classes of hash functions. I've used their conclusions myself with good results, see here:
http://stochasticgeometry.wordpress.com/2008/03/29/cache-concious-hash-tables/
ace, 2008/02/04 18:33
Nice work.
So, even if the K&R idea is close to the reinvented "simple steps for each character, complex step at the end", the function they published is still not good enough for your "numbers" test, but otherwise is not so bad. And using primes is not so good idea.
But if I remember, the "K&R" code you use is still not the "from the book" one as they don't xor higher and lower bits?
And also, if I properly remember the K&R book, they use what Wikipedia article about "hash table" names "separate chaining". Would you compare how everything behaves then? Of course, for good results, you shouldn't allocate each new node with a separate malloc call.
Peter Kankowski, 2008/02/05 14:29
You have a good memory :). K&R don't XOR the lower and higher halves, but I tested both versions, and XORing was usually faster because of a lower number of collisions:

                    Bernstein:    K&R:           x17:          x65599:        FNV-1a:
Words without XOR   151 [  308]   147 [  196]    140 [  240]   141 [  293]    152 [  272]
Words with XOR      147 [  261]   142 [  194]    137 [  193]   139 [  250]    151 [  262]

Win32 without XOR   893 [ 1087]   892 [  932]    839 [  880]   866 [ 1049]    963 [ 1076]
with XOR            889 [  889]   897 [ 1006]    847 [ 1002]   843 [  816]    951 [ 1021]

Numbers without XOR 825 [18330]   869 [20810]    76 [  360]    298 [ 5400]    91 [  409]
with XOR            453 [ 8030]   842 [19533]    81 [  340]    205 [ 3158]    88 [  207]

Prefix without XOR  333 [ 262 ]   329 [  192]    316 [  235]   329 [  251]    374 [  260]
with XOR            328 [  214]   339 [  274]    314 [  244]   319 [  200]    368 [  233]

Postfix without XOR 320 [262  ]   320 [  234]    304 [  237]   318 [  265]    355 [  211]
with XOR            316 [  226]   323 [  256]    300 [  228]   316 [  316]    356 [ 237 ]

Variables without X 671 [  948]   668 [  762]    654 [ 1109]   642 [  883]    697 [  787]
with XOR            658 [  697]   667 [  811]    641 [  863]   641 [  904]    693 [  790]

Shakespeare without 882 [ 1151]   915 [ 1168]    844 [ 1019]   837 [ 1135]    890 [  982]
with XOR            884 [ 1131]   893 [ 1188]    832 [ 1012]   838 [ 1111]    908 [ 1063]


For K&R function, the version without XOR seems to be better, so I will use it in future tests. FNV authors recommend using XOR, though their function often works better without it. XORing also does not help complex functions, because they already mix the bits.

The choice between separate chaining and open addressing depends on your task. If you need add-only hash table (a symbols table in compiler, a word counting program, etc.), open addressing will be faster because of better data locality for caching.

If you need to delete items from the table, separate chaining may be faster. Or may be not: note that in Python dictionary, they use some complex variation of open addressing, not separate chaining (see Beautiful code book). I will not do the tests, because the results will depend on the usage pattern of a particular application (how often does it delete items? which items? how many times is the table resized? etc.) You should benchmark it on some specific task.
ace, 2008/02/06 15:43
I expect much less collisions and much better behaviour with separate chaining in the "numbers" test with the K&R function, that's why I asked.
I always used separate chaining for my hashes. Unless you know at the start how big hash you need, separate chaining behaves better.
ace, 2008/02/06 21:04
Comparing K&R and x17 using efficiently implemented "separate chaining", for all your input sets I get:

   Kernighan&Ritchie: |       179 [  234]
                 x17: |       179 [  198]

   Kernighan&Ritchie: |       971 [ 1022]
                 x17: |       971 [  982]

   Kernighan&Ritchie: |       108 [ 1000]
                 x17: |        84 [   48]

   Kernighan&Ritchie: |       311 [  226]
                 x17: |       310 [  214]

   Kernighan&Ritchie: |       299 [  234]
                 x17: |       299 [  226]

   Kernighan&Ritchie: |       773 [  822]
                 x17: |       782 [  868]

   Kernighan&Ritchie: |      1116 [ 1282]
                 x17: |      1137 [ 1276]


As expected, x17 is noticeably faster only in your "Numbers" test set. Otherwise the numbers of collisions are very similar. Also note that the pathological behaviour is fully avoided.
ace, 2008/02/06 21:09
In the previous message, the missing numbers eaten by the blogging software on the right from 299s are:

  234
  226

ace, 2008/02/06 21:15
For the reference, the results of your original code on my machine:
   Kernighan&Ritchie: |       186 [  196 ]
                 x17: |       183 [  193 ]

   Kernighan&Ritchie: |       987 [  932 ]
                 x17: |       992 [ 1002 ]

   Kernighan&Ritchie: |      1056 [20810 ]
                 x17: |       109 [  340 ]

   Kernighan&Ritchie: |       321 [  192 ]
                 x17: |       329 [  244 ]

   Kernighan&Ritchie: |       304 [  234 ]
                 x17: |       304 [  228 ]

   Kernighan&Ritchie: |       793 [  762 ]
                 x17: |       801 [  863 ]

   Kernighan&Ritchie: |      1153 [ 1168 ]
                 x17: |      1132 [ 1012 ]

Peter Kankowski, 2008/02/07 03:56
Thank you very much! It looks to be faster than my program. May I ask what table size did you use? How was the memory allocated?

Could you please share your code via an upload site (such as RapidShare.de or Zalil.ru)?
ace, 2008/02/07 11:34
I kept the table sizes the same as you did. As I've already said, the malloc was not called.

Now, your code depends a lot on knowing exactly how many strings it has to process before it determines the table size. And if we use more elements for the same table size my program will run circles around yours.

I've checked now the wikipedia entry I've mentioned (the first time I've just looked up how they call "chaining" and didn't read anything) and I can just tell you -- don't believe them when they write:

"in most cases the differences between these algorithms is marginal, and other considerations typically come into play." ( false )

It looks to me that the author of wikipedia article "invented" that conclusion (that's against the wikipedia policy, but wikipedia shouldn't be used as authoritative reference anyway).

It seems that your statements "The choice between separate chaining and open addressing depends on your task. If you need add-only hash table (a symbols table in compiler, a word counting program, etc.), open addressing will be faster because of better data locality for caching" are also based on that wikipedia article and not on your actual or thought experiments.

Properly implemented, chaining is always much faster in "harder" cases and has comparable performance otherwise. Using it is more important then trying to be smart with the hash function. And of course, in practice there's seldom any benefit of using anything more complicated than the K&R-style function -- in that aspect you were on the right track. Only for very long strings, there can be benefit of processing more bytes at once during hash index calculation. In any real use, the time spent in hash index calculation is not what makes any program slow, not using chains often is (and of course, using mallocs where they are not needed alwasy slows things down).

I'm sorry that I can't upload the code, but it's easy to repeat the results. I've counted the "collisions" any time the string comparison to equality failed.
Peter Kankowski, 2008/02/07 16:08
For fair comparison, table size should be smaller for separate chaining.

Separate chaining needs more memory because of the next item pointers (usually 3 times more than open addressing with the same table size), so people reduce the size of the table to save space. Sedgewick recommends setting table size to 1/5..1/10 of the number of items (on average, 5..10 items in each chain). For open addressing, he uses table size = 2..4 * N.

If you make a sparser table, the search will be faster (a classical time-space compromise), and it's true for both methods. Open addressing would also be faster if you take a larger table. I would use not 1/5..1/10, but somewhere around 1/2..1 for separate chaining. Memory consumption should be the same, so it will be fair.

I've counted the "collisions" any time the string comparison to equality failed.
You probably counted them twice (when inserting into the table and when searching in it). That's why your figures for collisions should not be compared with mine.

...based on that wikipedia article and not on your actual or thought experiments
My statements are based on the understanding of computer architecture and caching. Open addressing requires less memory accesses. If the string comparison fails, it reads the next item, usually from the same cache line. Separate chaining reads from non-adjacent cache lines.

There is another way to reduce the number of memory accesses (used in Python dictionary implementation). Let's store a full 32-bit hash in the hash table cells. It would require twice more memory, but string comparison will be rarely needed (only when the hashes are equal). I will probably try this.

About prime numbers: I just found in Sedgewick's book that they were used in modular hashing (not to be confused with multiplicative hashing). "Before high-level languages appeared" (i.e., long before your and my birth :), they used something like this:


UINT hash(const CHAR* str) {
return *(UINT*)str % TABLE_SIZE;
}


Here, TABLE_SIZE should not be the power of 2, or else it will mask out the letters in higher bytes. Today this function is not very useful, because division is much slower than other operations on modern processors. It also will show terrible results in the "Prefix" test.

What is more important is that your program with separate chaining eliminates the worst-case ("Numbers" test). It's really good.
nils, 2008/02/07 18:30
> Here, TABLE_SIZE should not be the power of
> 2, or else it will mask out the letters in
> higher bytes. Today this function is not very
> useful, because division is much slower than
> other operations on modern processors. It
> also will show terrible results in
> the "Prefix" test.

Wait wait wait, guys..

No compiler does a slow division to get the modulo of a constant these days. No matter if your constant is a prime or a power of two.

Two multiplications and some adds and subs is all you need for any number. Powers of two just need an logical and (as everyone here knows).

So using a modulo is not *that* much slower. I think memory access times are much more important.


Besides that: The number of collisions become very important if the data you hash is large. For things like filenames you can get away with a simple hash, but try to hash megabyte large blobs of data, possible with paging from disk.

You'll be glad about any collision you don#t have in these cases.
ace, 2008/02/07 19:00
> usually 3 times more than open addressing with the same table size

3 times more? Of what? Certainly not in my code. If you can't imagine it without actually programming, then try to implement it efficiently -- you'll see then.


> Memory consumption should be the same, so it will be fair.

In my view, to be fair

1) Try to fix the table size before you know how many elements you have to hash (you have to do this for almost any real life purpose). With open addr, you will either use insane amounts of memory "just to be sure" or will have to make new tables each time you cross some percentage of table occupancy. That percentage is important, since the rest of the pointers is going to be unused. And making a new table on these points is also expensive (you have to rehash everything again).

2) Forget about nice power of two table sizes -- once you care about the memory you can't allow the luxury of having each table size twice as big as previous.

Everything else is not realistic enough.


> You probably counted them twice (when inserting into the table and when searching in it)

You're right. So you count "collision" only after inserting? Moreover, should "collision" be every cmp or only "did it hit empty table entry or not"? Personally I'd avoid the term "collision" at all -- it would be better to count some specific operations which have some specific costs (cmp, hash calc etc). Or if we observe "table occupancy" it's also not the full information there once the chaining exists. Etc. More or less boring things -- personally I wouldn't mix the speed tests with the evaluations of the properties of the hash functions or the program strategies.


> Separate chaining reads from non-adjacent cache lines.

What makes you think that?


> About prime numbers: (...) they were used in modular hashing (..) "Before high-level languages appeared"

Exactly! That's why people who knew about that even later tried to keep the table sizes uneven. However I still stumble on the texts that preach the major importance of primality (and I never believed them). On another side, most of the times you don't have the luxury of having hash tables with only of power of 2 sizes, so the "modulo" calculation is hard to be avoided in practice -- that's why I liked to see it in the tests. :) As the added benefit, it demonstrated that dependence of hashes on primes is not really needed.

Btw, I've just checked K&R 2ed to see their example, now you know the whole story behind their

#define HASHSIZE 101


line. :)

(speaking of defines, in your cpp code you should use enums instead)
ace, 2008/02/07 19:06
nils, you're right, once the access to the hashed elements is costly, the investment in the hash function with minimal collisions is the best way to go.
Re modulo operation, as far as I know, it can't be avoided if the the table size is not known in advance.
ace, 2008/02/07 20:27
Chaining, table size 1/2 of that from the "open addr" code, counting "collisions" only in 2nd phase:

Numbers / 500 lines / 512 elements in the table
           Bernstein:        109 [  500 ]
   Kernighan&Ritchie:        111 [  500 ]
                 x17:        101 [  168 ]
              x65599:        134 [  780 ]


Orig code (no chaining, table size twice as big as in chaining):

Numbers / 500 lines / 1024 elements in the table
           Bernstein:        516 [ 8030 ]
   Kernighan&Ritchie:       1056 [20810 ]
                 x17:        109 [  340 ]
              x65599:        262 [ 3158 ]

Peter Kankowski, 2008/02/08 04:53
Nils, thank you for the idea. I did not know about possible optimization for modulus. Just tried some constants on MSVC++ 2005 (full optimization, favor fast code):

; UINT j = i % 257;
mov  eax, 0xff00ff01
mul  esi
shr  edx, 8
imul edx, 257
mov  ecx, esi
sub  ecx, edx

; UINT j = i % 127;
mov	 eax, 0x02040811
mul	 esi
mov	 eax, esi
sub	 eax, edx
shr	 eax, 1
add	 eax, edx
shr	 eax, 6
imul	 eax, 127
mov	 ecx, esi
sub	 ecx, eax


It's equivalent to j = i - i / 257 * 257, where i / 257 is calculated with MUL. Unfortunately, this optimization is impossible if table size is not constant.

There is also a method for optimizing modulo of 2^N-1 (Mersenne primes). Note that his code is wrong for large numbers. The correct, slower code will be:


UINT mod127(UINT k) {
const UINT p = 127, s = 7;
do {
k = (k & p) + (k >> s);
} while (k > p);
return k == p ? 0 : k;
}


You are right about the case when there is a lot of data. Paul Hsieh said his function is used by Adobe, Apple, and Google. As far as I understand, they use it for large amount of data (such as Google web search index). Complex functions perform better in this case, and his function is one the best for this kind of tasks.
Peter Kankowski, 2008/02/08 06:15
> 3 times more?

I'm sorry, not 3 times, but N additional pointers for the next items, where N is the number of strings inserted into the table.

About non-adjacent cache lines. With separate chaining, you need to look up the hash value in the index table, then walk to the key-value pairs (see the picture). They are usually located far from each other in memory. If the value is not found in the first record, you need to follow the next pointer, which also can be far from the first one.

Now imagine open addressing with stored 32-bit hash. In case of collision, it will read from adjacent memory cells (usually, no need for string compare), so it may be faster than other methods. I will test this.

A hybrid scheme looks interesting: items in the collision chain will be close to each other, so it should be fast. Another scheme.


> And making a new table on these points is also expensive (you have to rehash everything again).
Resizing hash tables is another interesting topic. For now, let's assume that we know the approximate table size before starting hashing. For example, in a word counting program you could divide the size of the text file by the average word length to get the estimated size.

Thank you for the results; they are really impressive. Why can't you share your code? Zalil.ru is really easy to use (even with no knowledge of Russian language): just choose your file. It will upload it and give you a link on their server. RapidShare is another popular server.
ace, 2008/02/08 11:30
> About non-adjacent cache lines. With separate chaining (...)

I still claim you didin't try to imagine how it should be implemented.

> A hybrid scheme

Just another bad wikipedia article with unsubstantiated claims, written by the author who doesn't understand the topic, or maybe just edited to stupidity by the following "contributors" -- I'm not interested to check how it happened.

> Another scheme

This at least doesn't fail the "first glance test".

> For example, in a word counting program you could divide the size of the text file by the average word length to get the estimated size.

The perl script is 25 lines, the input files are 5 GB each. Estimate how big hashes should be constructed during the interpretation of the script on a 256 MB machine.

> Why can't you share your code?

My code is not important. I know you were able to understand everything (even without actually programming!) if you didn't stick to your initial idea and code.
Nils, 2008/02/08 13:21
Hi Peter,

You can simply calculate the fast modulo using reciprocals yourself. No need to rely on the compiler optimization.

The code to find the magic constants for reciprocal multiplication can be found here: http://www.hackersdelight.org


The hashtable size does not change that often, so it's not a big deal.

Btw, I may be wrong, but I remember that compilers generate better code for the optimized division when you work with unsigned values for the index and the prime-constant.
Arash Partow, 2008/02/09 08:02
Hi Peter,

Perhaps you could use the following code:



#include
#include
#include
#include
#include
#include

template
inline bool read_into_sequence(const std::string& file_name, Sequence& buffer)
{
std::ifstream file(file_name.c_str());
if (!file) return false;
std::string line;
while (std::getline(file,line))
{
buffer.push_back(line);
}
file.close();
}

unsigned int ap_hash(const std::string& str)
{
unsigned int hash = 0xAAAAAAAA;

for(std::size_t i = 0; i < str.length(); i++)
{
hash ^= ((i & 1) == 0) ? ( (hash << 7) ^ str[i] * (hash >> 3)) :
(~((hash << 11) + str[i] ^ (hash >> 5)));
}
return hash;
}

template
void collision_test(const StrSequence& str_list,
const HashFunction& function,
const IntSequence& quantizer_list)
{
typedef std::map hash_value_list_type;
std::vector hvl(quantizer_list.size());
for(StrSequence::const_iterator it = str_list.begin(); it != str_list.end(); ++it)
{
unsigned int hvlit = 0;
for(IntSequence::const_iterator qit = quantizer_list.begin(); qit != quantizer_list.end(); ++qit, ++hvlit)
{
unsigned int hash_value = function(*it) % (*qit);
hash_value_list_type::iterator hash_it = hvl[hvlit].find(hash_value);
if (hash_it != hvl[hvlit].end())
hash_it->second++;
else
hvl[hvlit].insert(std::make_pair(hash_value,1));
}
}

for(std::vector::iterator hvlit = hvl.begin(); hvlit != hvl.end(); ++hvlit)
{
unsigned int total_collisions = 0;
for(hash_value_list_type::iterator it = (*hvlit).begin(); it != (*hvlit).end(); ++it)
{
if (it->second > 1)
{
total_collisions += (it->second - 1);
}
}
std::cout << total_collisions << "\t";
}
std::cout << std::endl;
}

int main()
{
std::vector str_list;

if (read_into_sequence("words.txt", str_list))
{
std::cout << "ERROR - Could not read list!" << std::endl;
return 1;
}

std::vector quantizer_list;
quantizer_list.push_back(std::numeric_limits::max());

quantizer_list.push_back(256); quantizer_list.push_back(512);
quantizer_list.push_back(1024); quantizer_list.push_back(2048);
quantizer_list.push_back(4096); quantizer_list.push_back(7477);
quantizer_list.push_back(8011); quantizer_list.push_back(8192);
quantizer_list.push_back(8329); quantizer_list.push_back(9059);
quantizer_list.push_back(16384); quantizer_list.push_back(32768);
quantizer_list.push_back(65536); quantizer_list.push_back(131072);
quantizer_list.push_back(262144); quantizer_list.push_back(469207);
quantizer_list.push_back(524288); quantizer_list.push_back(544771);
quantizer_list.push_back(711209); quantizer_list.push_back(902677);
quantizer_list.push_back(1048576); quantizer_list.push_back(1299289);
quantizer_list.push_back(2097152); quantizer_list.push_back(4194304);
quantizer_list.push_back(8388608); quantizer_list.push_back(16777216);

collision_test(str_list,ap_hash,quantizer_list);

return 0;
}

Peter Kankowski, 2008/02/09 16:33
Thank you again, Nils. As Warren says in his book, unsigned values require additional operations. In x86 ISA, there is no SHRXI instruction that he uses, so we have to use a longer code with 2 SHRs.

When you first said about modulo optimization, I thought that it's possible to generate the code on the fly. In this case, we can use faster sequences for some divisors instead of general and slower code, but code generation can slow down the whole program. Another solution is to select one of 32 pre-generated functions for prime divisors near to 2^K (one function for each possible value of K).

In most cases, using prime numbers provide similar number of collisions to that of 2^K table sizes (see the table above), so I'm not sure if using these tricks worth a cost. It will not be much faster than the original program, but it's interesting and I will possibly try this. Thank you very much for your contribution.
Peter Kankowski, 2008/02/09 16:39
Ace, please show your code; it will explain everything. I just don't understand your idea.

> the input files are 5 GB each. Estimate how big hashes should be constructed...
I agree with you, it should have some reasonable limit on hash table size. However, the basic idea is correct. It makes difference if you count words in a 20 KB article or in a multi-megabyte book such as "War and Peace". You can use the file size to estimate the size of the table and avoid resizing it too often.
Peter Kankowski, 2008/02/09 17:00
Arash, thank you for following the discussion. Your code is good for evaluating the number of collisions, but it's not suitable for performance comparison. I just use a different method for comparisons, please try to understand it.

Your function shows mediocre speed in my tests because of using ((i & 1) == 0) condition that was not optimized to branchless code by the compiler. Weinberger's function has the same problem (on old computers, the cost of branch misprediction was lower, so I guess his function was quite fast for his time). On modern computers, branches on random data should be avoided in inner loops whenever possible.

In other aspects (the number of collisions) your function is good and shows quite nice results. You should try to avoid the branch inside the loop.
ace, 2008/02/10 00:22
Peter, consider differencies of open addressing in your implementation (OA) vs chaining (C) for 500 strings. OA: 1024 pointers in the table and the strings. C: 512 pointers in the table. You already answered how many pointers more are needed. Now look at the results from OA and C -- everything is there. The OA needs more comparisons than C. It's obvious why -- even the first collision in OA spoils some other entry. The results show how that progresses. In C, all entries of one index value don't influence another. That's why C is better.

Finally, this "cache-hits bla bla bla" what somebody wrote and you stuck to is a nonsense, as I've already said. Check what happens in OA collisions: the code can't just play with pointers, it must do string comparisons, so OA in that case is not better.

Just start implementing C.
Arash Partow, 2008/02/10 03:41
Hi Peter,

Thanx for the comments and advice on my function, I'm aware of the speed issues in the published versions the only reason why it and other hash function on my site are described like that is for explanatory purposes. In fact most of the hash functions on the page can be speeded up trivially by unrolling the loops somewhat.

For example when I use my hash function in real life I remove the if statement and process 2 or 4 or 8 bytes at a time rather than the 1 byte per loop. In any case a good optimizing compiler should see that the branching statement is not related to the data but the iterator and from there should be able to come up with a decent execution path. An example of how this optimization can be done:


unsigned int ap_hash(const char* begin, const char* end)
{
unsigned int hash = 0xAAAAAAAA;
unsigned int length = static_cast(end - begin);
unsigned int rounds = length / 2;
const char* it = begin;
for(std::size_t r = 0; r < rounds; ++r)
{
hash ^= (hash << 7) ^ (*it++) * (hash >> 3);
hash ^= ~((hash << 11) + (*it++) ^ (hash >> 5));
}
if (1 == (length & 0x01))
{
hash ^= (hash << 7) ^ (*it) * (hash >> 3);
}
return hash;
}


Note: the above can be done to make of the hash function presented.

That said a purely "collisions oriented" hash function test suite would be great to have, I'd be willing to collaborate with you and others here to come up with a set of tests (BJs or PHs) to definitively test hash functions.

The reason why I say this is from experience, good hash functions from a collisions pov are easier to optimize for speed than good hash functions from a speed pov are upgraded to perform well from a minimizing collisions pov.
ace, 2008/02/10 12:34
Arash,

> In any case a good optimizing compiler should see that the branching statement is not related to the data but the iterator and from there should be able to come up with a decent execution path.

Peter uses MS compiler. I've tried with gcc 4, and it also can't make such optimization. Which compiler can?
Peter Kankowski, 2008/02/10 14:24
Ace, thank you, I will do the version with chaining.
Arash Partow, 2008/02/10 20:10
Ace,

>>Which compiler can?
I'm not sure if any compilers do at this point, that is why I reorganize things. The point I was trying to make was that one-at-a-time hash's should have their loops unrolled (where possible).
Peter Kankowski, 2008/02/11 02:26
Arash, with this optimization, your function looks much better. I will include it in the next benchmark.

About the test suite. What do you think about finding the average collision chain length in addition to the number of collisions?

There are probably other valuable metrics. I'm not so good at math aspects of hash functions (programming always was more interesting for me :), so if you have other ideas, please propose them. You seems to have more experience in math than me. BTW, what are BJs and PHs?

> The point I was trying to make was that one-at-a-time hash's should have their loops unrolled (where possible).
Thank you, I will try to unroll other functions, too.
ace, 2008/02/11 12:37
Peter, "average chain length" hides "maximal chain length" which can in OA have bigger impact. I would simply count the number of hash calls and the number of string comparisons for each algorithm, for different percentage of occupancy of slots (that's what you don't do now) and different input sets.

You did a good job identifying a good examples for weaknesses of OA with linear probing.

Regarding investigation of hash functions and algoritghm properties independently of speed, did you know you can do integer aritmetic in Perl just like in C?

sub myh
{
	use integer; # important
	my $h = 0;
	for my $i ( 0..9 ) {
		$h = $h * 33 + $i;
		my $v = $h & 0x1ff;
		print "$i\t$v\n";
	}
}

myh();


gives

0	0
1	1
2	35
3	134
4	330
5	143
6	117
7	284
8	164
9	301

ace, 2008/02/11 12:44
Arash,

>>> In any case a good optimizing compiler should
>> Which compiler can?
> I'm not sure if any compilers do at this point,

Thanks. Which one do you use? I guess MS?

Anybody to try Intel?
nils, 2008/02/11 16:38
I could run a GCC 4.2.2 on cygwin this evening.
ace, 2008/02/11 17:13
nils, thanks for possible 4.2.x test.

When I said Intel I was referring to the "Intel Compiler" (which is distributed with some licence managing -- limiting software even for Linux).

I've tried the latest gcc from 4.1 branch, and Peter uses VS 2005, but as far as I've seen up to now VS 2008 doesn't introduce some significant optimizations on the function level, that's why I asked about Intel Compiler.

Still, it is possible that gcc 4.2 has some new optimizations (I haven't analyzed what's new in 4.2), so it's worth that you try it.

But my question on that topic is still -- is there anybody to try Arash's original code on the Intel Compiler to see if the bit check of i can be moved outside of the loop by compiler alone?
Arash Partow, 2008/02/11 22:01
Hi Peter, Ace,

>>Arash, with this optimization, your function looks much better. I will include it in the next benchmark.
Can't wait to see the new results. :)


>>About the test suite. What do you think about finding the average collision chain length in addition to the number of collisions?

Average collision analysis is a moot point with regards to hash function, due to two reasons, a hash function is inherently a prng so its behavior is best described as a poisson process (when was the last time someone wanted to know the average value the mersenne twister gives?), and secondly average collisions would require a quantizer value - this as you can see is another debatable issue, what value to use? should it be prime or can it be a power of 2(for efficient "and"ing etc)?

As Ace suggested the maximal chain length would be a nice fact to know and a good measure. Knowing the mean chain length, and the varaince from the mean then assessing how close that is to the maximal would be another to know but these should be done with and without quantizers.

With regards BJ and PH, Bob Jenkins and Paul Hsieh, they both advocate analysis of avalanching properties - which I agree is important and does reveal a lot more than basic collision analysis. In short coming back to the prng model for hash functions, given some random string of random length, and a hash function that outputs n-bits, a good hash function as far as avalanching goes (not mentioning strict avalanching criteria) is one that has probability of the i'th bit in the output being 1 close to 0.5. A hash function that has some i'th output bit that is predictable eg: always 1 or most times 1 implies that the output bit is either not being included properly in the mixing process or that whatever mixing is occurring results in it being in said way, a simplistic example of this is if the result of a particular bit is the output of an "OR" gate with inverted inputs (C = A or ~A)


>> Which one do you use? I guess MS?

I use Intel (icc) 10.x, and it doesn't do it. But there is a good reasons why, as far as the c++ standard is concerned there are some issue which regards to reference aliasing that the optimizer in the compiler detects, and as a result is not capable of optimizing away. It may change in the 2009 standard ratification if threading is introduced as according the Sutter for threads to be apart of the language definition the language should provide some guarantees about memory model and some-such this would inherently result in positive side effects for such optimizations.

In some cases I find it better to craft the code specifically rather than relying on some black-box of a compiler's optimizer, you may write something and it may look good in contrived test cases, but then when used elsewhere the compiler may see other possible optimizations and as a result do "other" things. I guess that is why even today you still see people implementing data structures such as RB-Trees etc purely as macros.
peter_k, 2008/02/13 06:31
> Nils, thank you for the idea. I did not know about possible optimization for modulus. Just tried some constants on MSVC++ 2005 (full optimization, favor fast code):

Hmm, i've wrote about it in your blog before :-]
http://smallcode.weblogs.us/2007/01/05/what-your-compiler-can-do-for-you/
In most cases divide and modulo (together) can be done using only one MUL
Peter Kankowski, 2008/02/13 16:20
> Knowing the mean chain length, and the varaince from the mean then assessing how close that is to the maximal.
Arash, thank you very much for these ideas. If I will have some free time, I will implement them in my tests. You have a good framework for theoretical tests, so you could try them, too.

Prime quantizers are not needed for hashing strings; ANDing will do better.

About avalancing: multiplicative functions are not random at all, but they often show better results than PRNG-like functions.

Take a look at Python sources. They use a simple multiplicative hash function with M = 1000003 (search for "string_hash" in file stringobject.c). In dictobject.c, they say:

Most hash schemes depend on having a good hash function, in the sense of simulating randomness. Python doesn't, [...] but this isn't necessarily bad.

Read also their notes on optimizing dictionaries.

Thank you again for your help. Your function shows much better results after unrolling, and mine also wins from it :).

> i've wrote about it in your blog before
Peter, sorry, I forgot about this. I knew about "magic numbers" for division, but never thought that it can be applied to modulo. BTW, you will need additional shifts anyway, so it's not only one MUL.
ace, 2008/02/12 20:26
> The question remains, should you look for a better hash function or rely on chaining to fix the worst-case behavior.

Chaining can be used where OH can't. So when implementing only one strategy, not implementing chaining is the wrong choice. And once there's chaining, there's no real reason to implement OH. Second, with chaining, most hash functions will have the same effect -- the difference is hard to recognize in most of the cases. But, where the number of collisions is more important than speed (where access to the data is really expensive), not surprisingly, CRC32 proved to be the very safe bet. And for tables with up to 2^16 elements CRC16 can probably be enough.

Now, Peter, your function gives quite consistent good results in speed and number of collisions, at least in your set of tests! It's amazing that 17, even smaller constant than 33 and 31 went unrecognized up to now. Congratulations! Of course, this will not give significant speed up in practice (unless in cases where somebody unwisely used OH), where widely used functions all have the similar speeds.

But you also nicely demonstrated that the effort to develop the complicated functions (excluding CRC of course) was practically unneeded.

Here is some kind of "overall speed scores" based on the normalized results of the chaining tests with your code (I gave all test groups the same weight, that is maybe controversial) with

g_table_size_mask = NextPowerOfTwo(i) - 1;


that is, the table twice as small as in your OH tests.

(In chaining, only one pointer per string can be enough. The string can simply immediately follow the "next" pointer.)

Compiled with /O1

1.00 x17 unrolled
1.02 x17
1.04 Kernighan&Ritchie
1.06 Bernstein
1.08 x65599
1.13 Paul Hsieh
1.13 FNV-1a
1.14 universal
1.20 CRC-32
1.26 One At Time
1.34 Arash Partow
1.47 Weinberger
11.62 lookup3


(lookup3 is the only one really dangerous here -- not to be used unless it can be checked that the compiler optimized what was necessary to optimize to even use the function)

Compiled with /O2

1.00 x17 unrolled
1.00 x65599
1.00 x17
1.05 lookup3
1.07 universal
1.08 Paul Hsieh
1.09 FNV-1a
1.10 Kernighan&Ritchie
1.11 Bernstein
1.16 CRC-32
1.23 One At Time
1.32 Arash Partow
1.35 Weinberger


Other common cases not covered by your tests are hashing of integers, and hashing of memory addresses. In both cases there is a limited number of bytes, some patterns occur often and there's still the need to hash fast and good.
Peter Kankowski, 2008/02/13 15:51
Yes, chaining seems to be better choice than OA. I'm afraid that my function also may show bad results on some data set, and chaining really helps in such cases.

My function differs from the others not only by multiplier, but also by subtracting 0x20 from each character, which helps to "pack" more characters in the hash value. Certainly, this will work only for ASCII strings (you should use a different function for random binary data).

Arash said a lot about PRNG-type hash functions (with avalanching effect), but these functions are actually less effective for short strings than non-random multiplicative functions. If the table is small enough, a multiplicative function can pack all input characters into integer, so the hash value will be unique.

Thank you for the overall score figures and especially for lookup3 test. This function is too complicated for both compilers and humans :).

Do you mean hashing 32-bit integers and memory addresses? In this case, different hash functions must be used. Modulo prime should work well, and the optimization proposed by Nils will be very useful.
ace, 2008/02/13 17:02
> but also by subtracting 0x20 from each character

Many years ago, at the time I considered such a modification, I believed it shouldn't bring anything. However I never used OH or searched for pathological examples. Have you tried to get the results without that subtraction?

> Thank you for the overall score figures

That was on Core 2. I've made the same scoring on Pentium III, /O2 and received:

1.00 x17
1.02 x17 unrolled
1.07 lookup3
1.08 x65599
1.08 Kernighan&Ritchie
1.10 Paul Hsieh
1.13 Bernstein
1.15 universal
1.16 CRC-32
1.19 FNV-1a
1.27 One At Time
1.34 Arash Partow
1.38 Weinberger


Note that here a function with 1.16 is on average 16% slower than the best one. Of couse a bit of reasonable doubt remains as you selected the test sets. :)

In practice the hash functions are called between other processing, so the speed of actual hash functions can contribute even less to the overall speed. In such cases some shorter function can be better even if in tests like yours it appears slower than some with the bigger footprint. On another side, when we really want to minimize the number of collisions, it would be interesting to know "collison goodness factor" from your tests -- using the same method I get the following scores:

1.00 x17
1.00 x17 unrolled
1.03 CRC-32
1.05 lookup3
1.08 universal
1.11 Paul Hsieh
1.12 FNV-1a
1.13 One At Time
1.35 x65599
1.35 Bernstein
1.36 Weinberger
1.36 Kernighan&Ritchie
2.34 Arash Partow


Now, I also considered the possibility that your function gets too good score here since you selected the pathological example and possibly optimized your function only for that. So I removed the "numbers" results and the scores are:

1.00 lookup3
1.01 x17
1.01 x17 unrolled
1.03 CRC-32
1.05 universal
1.05 Paul Hsieh
1.06 Bernstein
1.06 x65599
1.06 FNV-1a
1.06 Kernighan&Ritchie
1.06 One At Time
1.07 Arash Partow
1.17 Weinberger


Note that most of the functions are very close. Not counting 'Weinberger', all inside of 7% span!

(Still, I'd rather count the number of comparisons and not "collisions".)

> Modulo prime should work well, and the optimization proposed by Nils will be very useful.

How much is the gain of using these compared to a single DIV in processor cycles? Can that method be implemented to work for any N? Can it be programmed to work without programmer manually testing constants (class FasterDiv) where the constants are calculated once (but in the runtime) and then often used?
Peter Kankowski, 2008/02/14 16:47
> Have you tried to get the results without that subtraction?

It was almost always slower:
               Words       Win32      Numbers    Prefix     Postfix   Variables Shakespeare
x17            146[237]   890[980]    85 [351]   339[277]   320[234]  684[1196]   875[1059]
x17 - 0x20     138[193]  848[1002]    81 [340]   313[244]   300[228]   640[863]   839[1012]
K&R            143[196]   892[932]  866[20810]   329[192]   320[234]   658[762]   878[1168]
K&R - 0x20     136[187]   842[917]  863[20810]   312[201]   302[207]   636[852]   840[1050]
Bernst.        145[261]   880[889]   426[8030]   326[214]   316[226]   649[697]   874[1131]
Bernst. - 0x20 139[229]   850[903]   468[9295]   319[276]   303[263]   640[756]   841[1025]


> How much is the gain of using these compared to a single DIV in processor cycles?

           % 1031 and 257     % 127
DIV            39                39
mod_table_size 14                17


All times are in clock cycles; measured with Agner Fog's program.


> Can that method be implemented to work for any N?

Some divisors require 4 additional instructions (compare 257 and 127). If you have a pre-calculated table of divisors close to the powers of 2, you can select only the divisors that give a shorter code. If it gives a longer code, just select the next prime number, e.g., 131 instead of 127.

Another solution is to use CMOV to ignore the result of the 4 additional operations when they are not needed, but this will be a little slower.
ace, 2008/02/14 17:44
>> Have you tried to get the results without that subtraction?
> It was almost always slower

I've asked because the method of using multiplication by constant is widely accepted but using subtraction not.

The "collision score tables" which include minus and no minus versions for various mult factors and using chaining look inconclusive:

Without numbers test

1.00 lookup3
1.01 x17
1.03 CRC-32
1.04 KandRM
1.04 universal
1.05 x65599
1.05 Paul Hsieh
1.06 Bernstein
1.06 Kernighan&Ritchie
1.06 FNV-1a
1.06 One At Time
1.07 x65599M
1.07 BernsteinM
1.10 X17NoM
1.16 Arash Partow
1.17 Weinberger


with numbers test

1.00 x17
1.00 x17 unrolled
1.03 CRC-32
1.05 lookup3
1.08 X17NoM
1.08 universal
1.11 Paul Hsieh
1.12 FNV-1a
1.13 One At Time
1.34 KandRM
1.34 x65599
1.35 Bernstein
1.35 Kernighan&Ritchie
1.35 Weinberger
1.37 BernsteinM
1.74 x65599M
8.97 Arash Partow


It looks to me that the visible "noise" of the results can not actually lead to some useful conclusion. If that operation really has the properties of conditioning the final result better even for these test cases, I'd expect that every function based on the multiplication to benefit from it? Here, K&R got better with subtraction, x65599 and Bernstein got worse.

Also note that "no subtraction" x17 comes much worse, too much to claim that 17 is much better factor than others.

Note also that 'a' - ' ' == 'A'. As it looks like that most of your test samples contain small letter strings, what you were doing is most of the times just making them ALL CAPS strings and testing the *same* functions!

The only useful real conclusion I have up to now is that functions with multiplication factors are still the good choice for practical purposes.
Peter Kankowski, 2008/02/15 14:16
> too much to claim that 17 is much better factor than others
Yes, it's the combination of the factor 17 and subtracting that makes it the fastest function.

> functions with multiplication factors are still the good choice for practical purposes.
That was the main point of my first article :).

Ace, what do you think about Python dictionary code?
ace, 2008/02/15 15:49
>> too much to claim that 17 is much better factor than others
> it's the combination of the factor 17 and subtracting that makes it the fastest function.

But let me point again, I beleive that given two functions on strings where one processes

char - Const

and the other

char

can both be considered as one and the same function. It doesn't look to me that you proved anything, except that the function gives good results exactly for the set of strings you've selected. Wouldn't just "toggling case" of your set before the hashing result in loss exactly where now the gain is? Do you want to say that you've developed the hash function that does good on strings with mainly 'a'..'z' but not on strings with mainly 'A'..'Z'?

> what do you think about Python dictionary code?

I haven't invested much time, so I can't say much. If I correctly see, they have a special treatment for very small dictionaries -- that's very important and a very good thing to do. Then, if I've correctly understood, they don't use chaining, and they resize the table once it's more than 2/3 full. Now if the 2/3 is an optimal limit is something that depends of their collision resolution efficiency. In my opinion you've demonstrated that without chaining and with linear probing even 1/2 can be "too full".
Peter Kankowski, 2008/02/17 09:26
> Do you want to say that you've developed the hash function that does good on strings with mainly 'a'..'z' but not on strings with mainly 'A'..'Z'?
No, it will not be slower. Here are the raw results with upper-cased text files (open addressing, Pentium M):

               Words      Win32        Prefix       Postfix    Variables  Shakespeare
Bernstei     148[287]     882[919]     335[267]     319[231]     662[787]   864[1050]
K&R          144[214]     889[933]     334[229]     321[221]     652[655]   865[1029]
x17          139[197]     838[848]     309[192]     300[226]     642[939]   856[1219]
x17 unroll   133[197]     821[848]     302[192]     294[226]     626[939]   823[1219]
x65599       139[213]     860[1090     323[215]     311[222]     639[860]   839[1129]
FNV-1a       150[222]     956[1021     369[216]     357[222]     686[655]   903[1067]


Mixed case and lower case:

x17 unrolled      1.02
x17               1.05
x65599            1.06
lookup3           1.06
Paul Hsieh        1.07
Bernstein         1.09
K&R               1.10
FNV-1a            1.18
Arash Partow      1.20
universal         1.20
CRC-32            1.23
One At Time       1.25
Weinberger        1.45


After conversion to upper case:
x17 unrolled      1.01
x17               1.04
x65599            1.06
lookup3           1.06
Paul Hsieh        1.07
K&R               1.09
Bernstein         1.10
FNV-1a            1.17
universal         1.20
Arash Partow      1.20
CRC-32            1.22
One At Time       1.25
Weinberger        1.45


x17 should be slower on strings with a lot of "\r\n" and other control characters (though I have not checked this).
Peter Kankowski, 2008/06/17 07:56
Austin Appleby, the author of Murmur hash, replied me by e-mail and commented about Murmur having a larger number of collisions:

Actually, the number of collisions that Murmur (and lookup3, and any cryptographic hash) produces is in the range predicted by statistics if the hash is truly random. I don't recall the equation offhand, but your x17 does better than predicted because it's less random. :)

Statistically good distribution is important in some applications because it allows me to say "My hash function will not produce pathological results with your keyset" with some certainty even if I have no idea what your keyset is. Bob Jenkin's frog.c test is particularly good at producing pathologically bad keysets - it creates large keys that are mostly 0 bits but with a small handful of 1s - that choke most simple-but-fast hash functions. Murmur happens to hit a nice sweet spot where it is simple, fast, and still statistically strong.

Won, 2008/06/17 08:03
Peter, thanks for doing these tests. I am very surprised how poorly Adler32 does; an order of magnitude worse in Shakespeare than Fletcher is unexpected. Your code looks like it matches the wikipedia version, though.


UINT32 HashAdler(const CHAR * data, SIZE_T len)
{
UINT32 a = 1, b = 0;

while(len > 0) {
SIZE_T tlen = len > 5550 ? 5550 : len;
len -= tlen;
do {
a += *data++;
b += a;
} while (--tlen);

a %= 65521;
b %= 65521;
}
return (b << 16) | a;
}

ace, 2008/06/18 01:14
I really like MurmurHash.

But for me the shortest Murmur variant is a bit of cheating. The starts of the strings in C and C++ are *not* aligned to 4 bytes unless typical malloc or new is made for each string or each string is copied to the convenient buffer before the hash is calculated. So I'd consider MurmurHashAligned2 a real function and I'd do the timings on the set of the strings which are packed in memory one after another, without anything between them (which makes the start of each of them unaligned and not predictable). I'd also like to see the comparison between "copy to the nice buffer then do MurmurHash2" and "just do "MurmurHashAligned2" for strings never longer than e.g. 1024 bytes -- when something is longer it's certainly not probable to appear in unaligned input (this test would probably show the quality of L1 cache?).

------

The need for "statistically good distribution" is exactly the reason why "adding constant" to each source byte should not matter, the reason why I didn't like your claim about the goodness of "subtracting constant" in x17.

------

Austin Appleby mentions on

http://murmurhash.googlepages.com/discussion

that he did chi-square and avalanche testing.

Chi-square test are fundamental tests and should be used for most of experiments. In comp science it was of course used to check goodness of rnd generators probably since they exists.

One possible introduction:

http://www.fourmilab.ch/rpkp/experiments/statistics.html

and the program which uses the test to evaluate the quality of "random" stream:

http://www.fourmilab.ch/random/
Won, 2008/06/18 06:40
FWIW, glibc malloc has been 8-byte aligned for a while, and current versions might even be 16-byte aligned. tcmalloc is 8-byte aligned.

http://code.google.com/p/google-perftools/

In addition to tcmaloc, the perftools include some interesting hash table implementations. If you can deal with the limitations, intrusive data structures can be much more efficient.
Won, 2008/06/18 06:42
My mistake, the hash table implementations are here:

http://code.google.com/p/google-sparsehash/
Arash Partow, 2008/06/19 05:31
Hi Peter,

Nice go for a 4th time around, but I think we previously discussed as far as timing is concerned if the hash functions are not implemented in a similar fashion then its really not a fair or meaningful experiment. The hash functions from my site are implemented in their most basic definition, for production purposes you wouldn't use them as is you would try and do things like duff's device or something similar to what Jenkins does.

Take for example the way you have unrolled my hash function, its probably the worst way it could have been done, further more why not djb, pjw or others? So as far as timing is concerned unless you can get them all on the same footing it is worthless/meaningless to mention times.

The next issue is collisions, I still don't accept your methods as being generally acceptable though it does seem valid for most situations, as I suggest previously it may be better to get together on this an define a real set of tests, standard inputs and testing methodologies.

In anycase keep up the good work.
Peter Kankowski, 2008/06/19 06:47
AC, thanks for the link to fourmilab.ch; his explanations are so clear.

x86 processor can access non-aligned dwords. My previous measurements showed that it's faster to read short strings without alignment, so that you can avoid an additional switch statement before the main loop. Alignment matters only if you are going to use the function on a different processor or to hash some very long strings.

Subtracting a constant matters because you then calculate a modulus of the hash table size, so you effectively throw off some high bits. You can save more information in low bits by subtracting ' ' (if you know that '\n', '\r', and other control characters will not appear in the hashed strings).

Certainly, x17 has nothing common with a statistically good hash function; it just tries to "pack" more characters into a small hash value.
 

Won, thank you, I'm studying SparseHash now. They used quadratic probing instead of separate chaining. It would be interesting to compare these approaches (I will probably do this in future).

The first problem with Adler and Fletcher is that sum2 will be masked away when calculating the modulus. I tried replacing the last line with the following:


return a ^ b;


and got much better results:

                   Words       Win32       Numbers     Variables   Shakespeare
Adler-32 original  151 [178]   830 [1070]  382 [7688]  725 [1564]  1540 [11686]
Adler-32 (a ^ b)   147 [157]   791  [559]  251 [3624]  655  [719]  1141  [3567]


The second problem is that the characters are not "weighted" (multiplied by different numbers), so that Adler-32("01") = Adler-32("10"), that's why it fails the Numbers test. Ditto for anagrams in Shakespeare's sonnets: Adler-32("heart") = Adler-32("earth").

So, Adler-32 may be a good checksum for compressed data, but I would not use it for hash tables. Murmur is definitely better :)
ace, 2008/06/19 16:13
> so you effectively throw off some high bits

No, you just shuffle the values of each input byte (1 becomes 225 etc).

> Certainly, x17 has nothing common with a statistically good hash function; it just tries to "pack" more characters into a small hash value.

Yes, your tests clearly demonstrate that even using simple K&R and separate chaining can often be good enough. Maybe Google searches for a stronger function only because they decided not to use separate chaining?

I haven't analyzed their code, but it can be that they didn't really have to avoid separate chaining -- that they were able to keep the same memory usage, or even have better behaviour in cases when the number of input elements is not known in advance.

I'd still enjoy to see any real life example which demonstrates the real need for statistically strong hash function, provided open addressing is avoided.
Peter Kankowski, 2008/06/20 02:02
Speaking about quadratic probing vs. separate chaining, Google gives the following link in their sources:

http://www.augustana.ca/~mohrj/courses/1999.fall/csc210/lecture_notes/hashing.html
Won, 2008/06/20 06:23
google-sparsehash also contains a dense hash map which is quite fast. Although, as I mentioned earlier, intrusive implementations are much faster (if you can live with the limitations).

http://www.boost.org/doc/libs/1_35_0/doc/html/intrusive/unordered_set_unordered_multiset.html

Boost, of course, has lots of interesting stuff.

I mentioned Cuckoo hashing as an alternative chaining method to Peter in an e-mail. There are variants that can sustain very high load factors.

http://en.wikipedia.org/wiki/Cuckoo_hashing

For Adler/Fletcher: ah, that makes sense. I suppose the prime mod versions would work much better. But, I don't think it is correct to say that Adler/Fletcher cannot distinguish between permutations (think about how 'b' accumulates). I also don't think a^b is a particularly good mix for Adler/Fletcher. There is the problem with how you compute the bucket from the hash. If you need N bits, maybe it makes sense to combine N/2 bits for a and N/2 bits from B. Maybe this is as simple as taking the middle N bits from an Adler/Fletcher hash, rather than the least N bits.
Arash Partow, 2008/06/20 06:55
There is a nice write by mitzenmacher that breaks down cuckoo hashing into nothing more than intriguing theory. It is virtually impractical/impossible to obtain the optimal or even get even the optimal theoretical performance in practise.
Won, 2008/06/20 07:45
My reading of Mitzenmacher (his blogs, I will read the paper) is mostly positive. Certainly there is always going to be a difference between theory and practice, but that doesn't mean there isn't a practical variation of a cuckoo hashing that is useful.

One variation is to allow for multiple items in a table entry (similar to a multi-way cache) before rehashing those elements. This variation is certainly not unique to cuckoo hashing, but seems to work well with it.

Anyway, my discussion is mostly from memory, but these postings are encouraging me to look into it myself...
Peter Kankowski, 2008/06/20 07:59
Oh, I'm sorry, 'b' will be different for the anagrams, so the hash value will be different.

For short strings, the higher byte of 'a' is zero, so if we take the middle N bits, lower bits of 'a' will be lost. It makes sense to calculate a ^ (b << (N/2)).

I've just tried a ^ (b << 4) for these test files, and the results were comparable with Murmur:

                   Words       Win32       Numbers     Variables   Shakespeare
Adler-32 original  151 [178]   830 [1070]  382 [7688]  725 [1564]  1540 [11686]
Adler-32 (a ^ b)   147 [157]   791  [559]  251 [3624]  655  [719]  1141  [3567]
Adler-32 a^(b<<4)  146 [135]   789  [475]   98 [ 500]  627  [428]   923   [784]


Thank you very much for your ideas.
ace, 2008/06/20 15:09
> http://www.augustana.ca/~mohrj/courses/1999.fall/csc210/lecture_notes/hashing.html

AFAIK the tables there also clearly demonstrate the superiority of chaining? See what happens when load factor increases.

So I'd still like to see any good argument to use open addressing, especially since it demands more complex hash function only to be able to perform acceptably and then it even requires using this function more often, and behaves exponentially worse as load factor even approaches 1 (I'd call that lose-lose-exp(lose) scenario :) ).
Won, 2008/06/20 23:22
Ace, those are some good points.

However, there are some advantages to open addressing. A non-intrusive C++ hash table implementation that uses chaining (e.g. tr1/unordered_map, based on SGI STL hash_map) has to allocate nodes to store the data + the link pointers. This means memory allocations, copies, and indirections that contribute to the constant factor for chained hash table implementations. For typical load factors (1/2 to 2/3), open addressing (google-sparsehash uses quadratic probing) can be faster.

http://google-sparsehash.googlecode.com/svn/trunk/doc/performance.html

There is also the slight advantage that open addressing can be slightly more compact, but I don't know how important this actually is. Sustaining a high load factor is probably much more important than avoiding pointer overhead, so chaining might be even better in this regard. If you really care about compactness, something like Judy might be better:

http://nothings.org/computer/judy/

An intrusive hash table implementation avoids most of the practical problems of chaining. Since you avoid all those copies, allocations and indirections. These are much faster than any of the non-intrusive options, but they are not appropriate in every situation. It is not always possible to instrument classes to be used in such an application. The instrumentation overhead affects all instances, not just those in hash tables. Ownership semantics are very different, and that can lead to design subtleties (amplified by concurrency).
ace, 2008/06/22 13:05
Won, when you refer to "intrusive implementations" do you talk about the concept where the item being owned by the container contains a placeholder for the pointer to be used by the container? If so, of course the same method can be used in chaining. So I still fail to see the advantage of open addressing?

Moreover how can the load factor which so influences open addressing be kept constant without introducing even bigger performance penalties, when even slight changes in load factors force rebuild of the whole container?

I agree with you that ownership is very important. Personally I don't think it's smart overusing smart pointers in C++ and that when somebody doesn't want to care about ownership he should not use C++ at all but some language which has "natural" GC. For C and C++, I think the best results are achieved by maintaining a distinction between owning and referring containers. The compiler writers knew about that since forever I guess. And they certainly had to maintain a lot of hashes.

Is it possible for you to explain (if you know) what are the exact design decisions behind "sparse hash" since I fail to grasp them from the site? Only then it can be discussed if there's a possibility for improvement, or what the major contribution of that implementation is.

Thanks in advance.
Won, 2008/06/23 04:17
I only know as much about sparse-hash that is publicly known, which is essentially everything since it is open source! But no, I don't know the story of its creation.

Open addressing does not require additional information to handle collisions, because the "chaining" is implicit, based on the probing strategy. Objects are copied into the container without need for modification or ancillary data (maybe a little bit to indicate empty/full).

Explicit chaining requires some kind of pointer. Each bucket forms a short linked-list of elements. You can implement that linked-list in several ways. The non-intrusive way (like SGI hash_map) makes a node that is essentially a pair of the object + a pointer to the next node. The intrusive way (like Boost intrusive_unordered_map) is to have a special field within the object itself to point to the next object. So for STL-compatible containers, this isn't necessarily a huge win, since they are value-oriented. However, a non-owning, pointer-oriented intrusive hash map can be very fast because it avoids almost all copies and allocations.

BTW, ownership is always a design problem. GC does not automatically solve it for you.
ace, 2008/06/24 17:02
I've finally read carefully the sparsehash docs to understand what the goal of the author was. He wanted to make a replacement for SGI STL hash_set which doesn't introduce too big memory overhead for the hash table infrastructure.

Specifically, it seems that his problem was that when STL hash_set is used to *own* values (of some big structures?) a lot of memory was taken by placeholders for unpopulated entries.

That's why he designed the "sparsetable" which spends (more-or-less) a single bit to mark an used or unused entry in it, and allocates space to hold only the assigned values it owns. Then he used that "sparsetable" as the underlying table for the open-addressing hash table implementation.

So his goal was to fit as much as possible 'indexed' (with a hash function) structures in the memory without even spending a single pointer per structure kept, giving away performance to get the lower memory footprint.

Nice.
Enter your comment (wiki syntax is allowed):
OFGMY
 
hash_functions.txt · Last modified: 2009/05/04 09:09 (external edit)
 
Recent changes RSS feed Creative Commons License Driven by DokuWiki