Hash tables are popular data structures for storing key-value pairs. A hash function is used to map the key value (usually a string) to array index. The functions are different from cryptographic hash functions, because they should be much faster and don't need to be resistant to preimage attack. Hashing in large databases is also left out from this article; the benchmark includes medium-size hash tables such as:
- symbol table in a parser,
- IP address table for filtering network traffic,
- the dictionary in a word counting program or a spellchecker.
There are two classes of the functions used in hash tables:
- multiplicative hash functions, which are simple and fast, but have a high number of collisions;
- more complex functions, which have better quality, but take more time to calculate.
Hash table benchmarks usually include theoretical metrics such as the number of collisions or distribution uniformity (see, for example, hash function comparison in the Red Dragon book). Obviously, you will have a better distribution with more complex functions, so they are winners in these benchmarks.
The question is whether using complex functions gives you a faster program. The complex functions require more operations per one key, so they can be slower. Is the price of collisions high enough to justify the additional operations?
Multiplicative hash functions
Any multiplicative hash function is a special case of the following algorithm:
UINT HashMultiplicative(const CHAR *key, SIZE_T len) {
UINT hash = INITIAL_VALUE;
for(UINT i = 0; i < len; ++i)
hash = M * hash + key[i];
return hash % TABLE_SIZE;
}
(Sometimes XOR operation is used instead of addition, but it does not make much difference.) The hash functions differ only by values of INITIAL_VALUE and multiplier (M). For example, the popular Bernstein's function uses INITIAL_VALUE of 5381 and M of 33; Kernighan and Ritchie's function uses INITIAL_VALUE of 0 and M of 31.
A multiplicative function works by adding together the letters weighted by powers of multiplier. For example, the hash for the word TONE will be:
INITIAL_VALUE * M^4 + 'T' * M^3 + 'O' * M^2 + 'N' * M + 'E'
Let's enter several similar strings and watch the output of the functions:
Bernstein Kernighan (M=33) (M=31) too b88af17 1c154 top b88af18 1c155 tor b88af1a 1c157 tpp b88af39 1c174 a000 7c9312d6 2cd22f a001 7c9312d7 2cd230 a002 7c9312d8 2cd231 a003 7c9312d9 2cd232 a004 7c9312da 2cd233 a005 7c9312db 2cd234 a006 7c9312dc 2cd235 a007 7c9312dd 2cd236 a008 7c9312de 2cd237 a009 7c9312df 2cd238 a010 7c9312f7 2cd24e a 2b606 61 aa 597727 c20 aaa b885c68 17841
Too and top are different in the last letter only. The letter P is the next one after O, so the values of hash function are different by 1 (1c154 and 1c155, b88af17 and b88af18). Ditto for a000..a009.
Now let's compare top with tpp. Their hashes will be:
INITIAL_VALUE * M^3 + 'T' * M^2 + 'O' * M + 'P' INITIAL_VALUE * M^3 + 'T' * M^2 + 'P' * M + 'P'
The hashes will be different by M * ('P' - 'O') = M
. Similarly, when the first letters are different by x, their hashes will be different by x * M^2
.
When there are less than 33 possible letters, Bernstein's function will pack them into a number (similar to Radix40 packing scheme). For example, hash table of size 333 will provide perfect hashing (without any collisions) for all three-letter English words written in small letters. In practice, the words are longer and hash tables are smaller, so there will be some collisions (situations when different strings have the same hash value).
If the string is too long to fit into the 32-bit number, the first letters will still affect the value of the hash function, because the multiplication is done modulo 2^32 (in a 32-bit register), and the multiplier is chosen to have no common divisors with 2^32 (in other words, it must be odd), so the bits will not be just shifted away.
There are no exact rules for choosing the multiplier, only some heuristics:
- the multiplier should be large enough to accommodate most of the possible letters (e.g., 3 or 5 is too small);
- the multiplier should be fast to calculate with shifts and additions [e.g., 33 * hash can be calculated as (hash << 5) + hash];
- the multiplier should be odd for the reason explained above;
- prime numbers are good multipliers.
Complex hash functions
These functions do a good job of mixing together the bits of the source word. The change in one input bit changes a half of the bits in the output (see Avalanche_effect), so the result looks completely random:
Paul Hsieh One At Time too 3ad11d33 3a9fad1e top 78b5a877 4c5dd09a tor c09e2021 f2aa9d35 tpp 3058996d d5e9e480 a000 7552599f ed3859d8 a001 3cc1d896 fef7fd57 a002 c6ff5c9b 08a610b3 a003 dcab7b0c 1a88b478 a004 780c7202 3621ebaa a005 7eb63e3a 47db8f1d a006 6b0a7a17 b901717b a007 cb5cb1ab caec1550 a008 5c2a15c0 e58d4a92 a009 33339829 f75aee2d a010 eb1f336e bd097a6b a 115ea782 ca2e9442 aa 008ad357 7081738e aaa 7dfdc310 ae4f22ec
To achieve this behavior, the hash functions perform a lot of shifts, XORs, and additions. But do we need a complex function? What is faster: tolerating the collisions and resolving them with chaining, or avoiding them with a more complex function?
Test conditions
The benchmark uses separate chaining algorithm for collision resolution. Memory allocation and other "heavy" functions were excluded from the benchmarked code. The RDTSC instruction was used for benchmarking. The test was performed on Pentium-M and Core i5 processors.
The benchmark inserts some keys in the table, then looks them up in the same order as they were inserted. The test data include:
- the list of common words from Wiktionary (500 items);
- the list of Win32 functions from Colorer syntax highlight scheme (1992 items);
- 500 names from a000 to a499 (imitates the names in auto-generated source code);
- the list of common words with a long prefix and postfix;
- all variable names from WordPress 2.3.2 source code in wp-includes folder (1842 names);
- list of all words in Sonnets by W. Shakespeare (imitates a word counting program; 3228 words);
- list of all words in La Peau de chagrin by Balzac (in French, UTF-8 encoding);
- search engine IP addresses (binary).
Results
Core i5 processor
Words | Win32 | Numbers | Prefix | Postfix | Variables | Sonnets | UTF-8 | IPv4 | Avg | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
iSCSI CRC | 65 | [105] | 329 | [415] | 36 | [112] | 84 | [106] | 83 | [92] | 280 | [368] | 408 | [584] | 1964 | [2388] | 322 | [838] | 1.01 | [1.78] |
Meiyan | 64 | [102] | 328 | [409] | 45 | [125] | 87 | [106] | 85 | [112] | 274 | [350] | 411 | [588] | 1972 | [2377] | 353 | [768] | 1.05 | [1.87] |
Murmur2 | 72 | [103] | 378 | [415] | 48 | [104] | 109 | [106] | 106 | [111] | 315 | [383] | 450 | [566] | 2183 | [2399] | 399 | [834] | 1.21 | [1.74] |
XXHfast32 | 78 | [110] | 372 | [420] | 57 | [102] | 88 | [103] | 88 | [106] | 315 | [347] | 473 | [491] | 2323 | [2494] | 463 | [838] | 1.23 | [1.71] |
SBox | 70 | [91] | 389 | [431] | 46 | [116] | 124 | [108] | 123 | [91] | 304 | [347] | 430 | [526] | 2182 | [2442] | 377 | [836] | 1.23 | [1.78] |
Larson | 72 | [99] | 401 | [416] | 34 | [16] | 143 | [99] | 141 | [105] | 312 | [366] | 451 | [583] | 2230 | [2447] | 349 | [755] | 1.25 | [1.10] |
XXHstrong32 | 79 | [109] | 385 | [429] | 58 | [102] | 93 | [102] | 92 | [112] | 321 | [355] | 474 | [491] | 2332 | [2496] | 464 | [838] | 1.25 | [1.72] |
Sedgewick | 73 | [107] | 417 | [414] | 36 | [48] | 143 | [103] | 143 | [103] | 319 | [348] | 446 | [570] | 2246 | [2437] | 349 | [782] | 1.26 | [1.33] |
Novak unrolled | 76 | [113] | 404 | [399] | 43 | [90] | 127 | [118] | 125 | [113] | 322 | [342] | 459 | [581] | 2284 | [2430] | 379 | [969] | 1.26 | [1.68] |
CRC-32 | 70 | [101] | 429 | [426] | 40 | [64] | 146 | [107] | 143 | [94] | 320 | [338] | 443 | [563] | 2231 | [2400] | 357 | [725] | 1.28 | [1.41] |
Murmur3 | 78 | [101] | 391 | [380] | 54 | [104] | 108 | [103] | 107 | [105] | 331 | [334] | 492 | [555] | 2360 | [2376] | 433 | [783] | 1.28 | [1.69] |
x65599 | 74 | [111] | 407 | [382] | 45 | [203] | 144 | [107] | 144 | [122] | 316 | [379] | 449 | [560] | 2221 | [2373] | 349 | [846] | 1.29 | [2.45] |
FNV-1a | 74 | [124] | 408 | [428] | 47 | [108] | 144 | [94] | 144 | [105] | 309 | [374] | 440 | [555] | 2193 | [2446] | 376 | [807] | 1.30 | [1.77] |
Murmur2A | 79 | [114] | 410 | [433] | 53 | [102] | 117 | [112] | 114 | [109] | 337 | [365] | 494 | [544] | 2377 | [2369] | 429 | [772] | 1.31 | [1.73] |
Fletcher | 71 | [131] | 352 | [406] | 80 | [460] | 104 | [127] | 100 | [108] | 312 | [507] | 481 | [1052] | 2477 | [4893] | 388 | [1359] | 1.31 | [4.62] |
K&R | 73 | [106] | 429 | [437] | 47 | [288] | 149 | [94] | 149 | [106] | 324 | [360] | 450 | [561] | 2266 | [2365] | 343 | [831] | 1.32 | [3.00] |
Paul Hsieh | 80 | [114] | 410 | [420] | 54 | [118] | 123 | [101] | 121 | [100] | 336 | [341] | 496 | [600] | 2351 | [2380] | 433 | [847] | 1.33 | [1.83] |
Bernstein | 75 | [114] | 428 | [412] | 49 | [288] | 150 | [100] | 150 | [102] | 324 | [353] | 460 | [572] | 2312 | [2380] | 351 | [703] | 1.34 | [2.99] |
x17 unrolled | 78 | [109] | 446 | [415] | 43 | [24] | 156 | [113] | 153 | [102] | 344 | [368] | 472 | [589] | 2361 | [2392] | 373 | [829] | 1.37 | [1.19] |
lookup3 | 83 | [101] | 459 | [412] | 55 | [97] | 140 | [101] | 137 | [95] | 359 | [361] | 526 | [550] | 2480 | [2392] | 427 | [834] | 1.42 | [1.65] |
MaPrime2c | 79 | [103] | 459 | [426] | 50 | [106] | 155 | [91] | 155 | [106] | 349 | [349] | 486 | [550] | 2493 | [2406] | 406 | [865] | 1.42 | [1.73] |
Ramakrishna | 80 | [108] | 513 | [409] | 44 | [91] | 189 | [125] | 186 | [103] | 370 | [360] | 483 | [528] | 2565 | [2383] | 380 | [840] | 1.51 | [1.66] |
One At Time | 85 | [105] | 562 | [421] | 58 | [110] | 221 | [97] | 220 | [103] | 392 | [364] | 511 | [545] | 2659 | [2346] | 459 | [795] | 1.72 | [1.75] |
Arash Partow | 83 | [101] | 560 | [435] | 71 | [420] | 215 | [98] | 212 | [85] | 392 | [355] | 507 | [570] | 2638 | [2372] | 407 | [779] | 1.72 | [3.88] |
Weinberger | 87 | [104] | 590 | [422] | 37 | [100] | 254 | [111] | 273 | [117] | 398 | [364] | 541 | [712] | 2734 | [2547] | 419 | [744] | 1.78 | [1.75] |
Hanson | 73 | [118] | 417 | [649] | 45 | [112] | 123 | [118] | 1207 | [499] | 318 | [435] | 448 | [592] | 2324 | [2890] | 370 | [833] | 2.70 | [2.46] |
Pentium-M processor
Words | Win32 | Numbers | Prefix | Postfix | Variables | Sonnets | UTF-8 | IPv4 | Avg | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Meiyan | 80 | [102] | 426 | [409] | 56 | [125] | 123 | [106] | 121 | [112] | 354 | [350] | 525 | [588] | 2443 | [2377] | 445 | [768] | 1.02 | [1.87] |
Novak unrolled | 90 | [113] | 517 | [399] | 56 | [90] | 169 | [118] | 164 | [113] | 398 | [342] | 575 | [581] | 2716 | [2430] | 482 | [969] | 1.18 | [1.68] |
Fletcher | 84 | [131] | 444 | [406] | 102 | [460] | 140 | [127] | 133 | [108] | 374 | [507] | 592 | [1052] | 2891 | [4893] | 513 | [1359] | 1.21 | [4.62] |
SBox | 88 | [91] | 552 | [431] | 57 | [116] | 181 | [108] | 178 | [91] | 414 | [347] | 560 | [526] | 2814 | [2442] | 472 | [836] | 1.22 | [1.78] |
Murmur2 | 97 | [103] | 532 | [415] | 65 | [104] | 165 | [106] | 162 | [111] | 434 | [383] | 622 | [566] | 2948 | [2399] | 537 | [834] | 1.25 | [1.74] |
CRC-32 | 90 | [101] | 565 | [426] | 55 | [64] | 198 | [107] | 192 | [94] | 427 | [338] | 590 | [563] | 2842 | [2400] | 469 | [725] | 1.26 | [1.41] |
x17 unrolled | 93 | [109] | 593 | [415] | 52 | [24] | 214 | [113] | 208 | [102] | 434 | [368] | 593 | [589] | 2867 | [2392] | 486 | [829] | 1.30 | [1.19] |
lookup3 | 94 | [101] | 565 | [412] | 70 | [97] | 189 | [101] | 182 | [95] | 432 | [361] | 631 | [550] | 2943 | [2392] | 572 | [834] | 1.32 | [1.65] |
K&R | 93 | [106] | 619 | [437] | 58 | [288] | 221 | [94] | 218 | [106] | 442 | [360] | 587 | [561] | 2961 | [2365] | 447 | [831] | 1.33 | [3.00] |
Larson | 95 | [99] | 631 | [416] | 49 | [16] | 231 | [99] | 228 | [105] | 455 | [366] | 599 | [583] | 3027 | [2447] | 469 | [755] | 1.35 | [1.10] |
XXHfast32 | 108 | [110] | 546 | [420] | 86 | [102] | 139 | [103] | 136 | [106] | 459 | [347] | 681 | [491] | 3259 | [2494] | 717 | [838] | 1.35 | [1.71] |
Murmur3 | 108 | [101] | 561 | [380] | 74 | [104] | 167 | [103] | 165 | [105] | 468 | [334] | 700 | [555] | 3259 | [2376] | 604 | [783] | 1.36 | [1.69] |
Bernstein | 97 | [114] | 622 | [412] | 61 | [288] | 225 | [100] | 222 | [102] | 448 | [353] | 609 | [572] | 3053 | [2380] | 469 | [703] | 1.37 | [2.99] |
XXHstrong32 | 108 | [109] | 558 | [429] | 86 | [102] | 150 | [102] | 147 | [112] | 460 | [355] | 682 | [491] | 3262 | [2496] | 714 | [838] | 1.38 | [1.72] |
x65599 | 99 | [111] | 628 | [382] | 61 | [203] | 234 | [107] | 232 | [122] | 459 | [379] | 630 | [560] | 3097 | [2373] | 471 | [846] | 1.40 | [2.45] |
Paul Hsieh | 106 | [114] | 576 | [420] | 82 | [118] | 183 | [101] | 178 | [100] | 456 | [341] | 678 | [600] | 3154 | [2380] | 670 | [847] | 1.41 | [1.83] |
Sedgewick | 101 | [107] | 667 | [414] | 52 | [48] | 245 | [103] | 242 | [103] | 478 | [348] | 630 | [570] | 3204 | [2437] | 475 | [782] | 1.42 | [1.33] |
Murmur2A | 113 | [114] | 598 | [433] | 78 | [102] | 183 | [112] | 178 | [109] | 488 | [365] | 719 | [544] | 3380 | [2369] | 651 | [772] | 1.44 | [1.73] |
FNV-1a | 102 | [124] | 660 | [428] | 62 | [108] | 239 | [94] | 237 | [105] | 473 | [374] | 627 | [555] | 3140 | [2446] | 516 | [807] | 1.44 | [1.77] |
MaPrime2c | 108 | [103] | 705 | [426] | 65 | [106] | 255 | [91] | 254 | [106] | 508 | [349] | 674 | [550] | 3413 | [2406] | 542 | [865] | 1.54 | [1.73] |
Ramakrishna | 108 | [108] | 728 | [409] | 61 | [91] | 278 | [125] | 272 | [103] | 511 | [360] | 660 | [528] | 3378 | [2383] | 517 | [840] | 1.56 | [1.66] |
Arash Partow | 106 | [101] | 739 | [435] | 93 | [420] | 280 | [98] | 275 | [85] | 514 | [355] | 671 | [570] | 3332 | [2372] | 543 | [779] | 1.65 | [3.88] |
One At Time | 118 | [105] | 830 | [421] | 81 | [110] | 321 | [97] | 319 | [103] | 578 | [364] | 741 | [545] | 3809 | [2346] | 657 | [795] | 1.82 | [1.75] |
Weinberger | 119 | [104] | 956 | [422] | 54 | [100] | 375 | [111] | 379 | [117] | 614 | [364] | 745 | [712] | 3973 | [2547] | 560 | [744] | 1.89 | [1.75] |
Hanson | 86 | [118] | 531 | [649] | 55 | [112] | 168 | [118] | 1722 | [499] | 393 | [435] | 549 | [592] | 2742 | [2890] | 463 | [833] | 2.60 | [2.46] |
Each cell includes the execution time, then the number of collisions in square brackets. Execution time is expressed in thousands of clock cycles (a lower number is better). Avg column contains the average normalized execution time (and the number of collisions).
The function by Kernighan and Ritchie is from their famous book "The C programming Language", 3rd edition; Weinberger's hash and the hash with multiplier 65599 are from the Red Dragon book. The latter function is used in gawk, sdbm, and other Linux programs. x17 is the function by Peter Kankowski (multiplier = 17; 32 is subtracted from each letter code).
As you can see from the table, the function with the lowest number of collisions is not always the fastest one.
Results on a large data set (list of all words in English Wikipedia, 12.5 million words, from the benchmark by Georgi 'Sanmayce'):
Core i5 processor
Wikipedia | Avg | |||
---|---|---|---|---|
iSCSI CRC | 5725944 | [2077725] | 1.00 | [1.00] |
Meiyan | 5829105 | [2111271] | 1.02 | [1.02] |
Murmur2 | 6313466 | [2081476] | 1.10 | [1.00] |
Larson | 6403975 | [2080111] | 1.12 | [1.00] |
Murmur3 | 6492620 | [2082084] | 1.13 | [1.00] |
x65599 | 6479417 | [2102893] | 1.13 | [1.01] |
FNV-1a | 6599423 | [2081195] | 1.15 | [1.00] |
SBox | 6964673 | [2084018] | 1.22 | [1.00] |
Hanson | 7007689 | [2129832] | 1.22 | [1.03] |
CRC-32 | 7016147 | [2075088] | 1.23 | [1.00] |
Sedgewick | 7060691 | [2080640] | 1.23 | [1.00] |
XXHfast32 | 7078804 | [2084164] | 1.24 | [1.00] |
K&R | 7109841 | [2083145] | 1.24 | [1.00] |
XXHstrong32 | 7168788 | [2084514] | 1.25 | [1.00] |
Bernstein | 7247096 | [2074237] | 1.27 | [1.00] |
lookup3 | 7342986 | [2084889] | 1.28 | [1.01] |
Murmur2A | 7376650 | [2081370] | 1.29 | [1.00] |
Paul Hsieh | 7387317 | [2180206] | 1.29 | [1.05] |
x17 unrolled | 7410443 | [2410605] | 1.29 | [1.16] |
Ramakrishna | 8172670 | [2093253] | 1.43 | [1.01] |
One At Time | 8338799 | [2087861] | 1.46 | [1.01] |
MaPrime2c | 8428492 | [2084467] | 1.47 | [1.00] |
Arash Partow | 8503299 | [2084572] | 1.49 | [1.00] |
Weinberger | 9416340 | [3541181] | 1.64 | [1.71] |
Novak unrolled | 21289919 | [6318611] | 3.72 | [3.05] |
Fletcher | 22235133 | [9063797] | 3.88 | [4.37] |
Pentium-M processor
Wikipedia | Avg | |||
---|---|---|---|---|
x17 unrolled | 11321744 | [2410605] | 1.00 | [1.16] |
K&R | 11666050 | [2083145] | 1.03 | [1.00] |
Bernstein | 11833902 | [2074237] | 1.05 | [1.00] |
Larson | 11888751 | [2080111] | 1.05 | [1.00] |
Sedgewick | 12111839 | [2080640] | 1.07 | [1.00] |
x65599 | 12144777 | [2102893] | 1.07 | [1.01] |
Arash Partow | 12235396 | [2084572] | 1.08 | [1.00] |
Ramakrishna | 12185834 | [2093253] | 1.08 | [1.01] |
Meiyan | 12269691 | [2111271] | 1.08 | [1.02] |
CRC-32 | 12604152 | [2075088] | 1.11 | [1.00] |
Murmur2 | 12713455 | [2081476] | 1.12 | [1.00] |
SBox | 12716574 | [2084018] | 1.12 | [1.00] |
Hanson | 12627597 | [2129832] | 1.12 | [1.03] |
lookup3 | 12791917 | [2084889] | 1.13 | [1.01] |
FNV-1a | 12868991 | [2081195] | 1.14 | [1.00] |
Murmur3 | 12916960 | [2082084] | 1.14 | [1.00] |
XXHfast32 | 12936106 | [2084164] | 1.14 | [1.00] |
XXHstrong32 | 12950650 | [2084514] | 1.14 | [1.00] |
Murmur2A | 13068746 | [2081370] | 1.15 | [1.00] |
Paul Hsieh | 12992315 | [2180206] | 1.15 | [1.05] |
MaPrime2c | 13348580 | [2084467] | 1.18 | [1.00] |
One At Time | 13662010 | [2087861] | 1.21 | [1.01] |
Weinberger | 14592843 | [3541181] | 1.29 | [1.71] |
Fletcher | 37410790 | [9063797] | 3.30 | [4.37] |
Novak unrolled | 37769882 | [6318611] | 3.34 | [3.05] |
Some functions were excluded from the benchmark because of very bad performance:
- Adler-32 (slow filling, not suitable as a hash function);
- TwoChars (bad for machine-generated names and variable names that are similar to each other, disastrous for large data sets such as Wikipedia).
The number of collisions depending on the hash table size (for the same data set, thanks to Ace for the idea):
Red Dragon Book proposes the following formula for evaluating hash function quality:
where bj is the number of items in j-th slot, m is the number of slots, and n is the total number of items. The sum of bj(bj + 1) / 2 estimates the number of slots your program should visit to find the required value. The denominator (n / 2m)(n + 2m − 1) is the number of visited slots for an ideal function that puts each item into a random slot. So, if the function is ideal, the formula should give 1. In reality, a good function is somewhere between 0.95 and 1.05. If it's more, there is a high number of collisions (slow!). If it's less, the function gives less collisions than the randomly distributing function, which is not bad.
Here are the results for some of our functions:
Conclusion
Complex functions by Paul Hsieh and Bob Jenkins are tuned for long keys, such as the ones in postfix and prefix tests. Note that they do not provide the best number of collisions for these tests, but do have the best time, which means that the functions are faster than the others because of loop unrolling. At the same time, they are suboptimal for short keys (words and sonnets tests).
For a word counting program, a compiler, or another application that typically handles short keys, it's often advantageous to use a simple multiplicative function such as x17 or Larson's hash. However, these functions perform badly on long keys.
Novak showed bad results on the large data set. Jesteress has a high number of collisions in numbers test.
Murmur2, Meiyan, SBox, and CRC32 provide good performance for all kinds of keys. They can be recommended as general-purpose hashing functions on x86.
Hardware-accelerated CRC (labeled iSCSI CRC in the table) is the fastest hash function on the recent Core i5/i7 processors. However, the CRC32 instruction is not supported by AMD and earlier Intel processors.
Download the source code (152 KB, MSVC++)
Variations
XORing high and low part
For table size less than 2^16, we can improve the quality of hash function by XORing high and low words, so that more letters will be taken into account:
return hash ^ (hash >> 16);
Subtracting a constant
x17 hash function subtracts a space from each letter to cut off the control characters in the range 0x00..0x1F. If the hash keys are long and contain only Latin letters and numbers, the letters will be less frequently shifted out, and the overall number of collisions will be lower. You can even subtract 'A' when you know that the keys will be only English words.
Using larger multipliers for a compiler
Paul Hsieh noted that large multipliers may provide better results for the hash table in a compiler, because a typical source code contains a lot of one-letter variable names (i, j, s, etc.), and they will collide if the multiplier is less than the number of letters in the alphabet.
The test confirms this assumption: the function by Kernighan & Ritchie (M = 33) has lower number of collisions than x17 (M = 17), but the latter is still faster (see Variables column in the table above).
Setting hash table size to a prime number
A test showed that the number of collisions will usually be lower if you use a prime, but the calculations modulo prime take much more time than the calculations for a power of 2, so this method is impractical. Even replacing division with multiplication by reciprocal values do not help here:
Words | Win32 | Numbers | Prefix | Postfix | Variables | Shakespeare | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Bernstein % 2K | 145 | [261] | 880 | [889] | 426 | [8030] | 326 | [214] | 316 | [226] | 649 | [697] | 874 | [1131] |
Bernstein % prime | 186 | [221] | 1049 | [995] | 445 | [5621] | 364 | [194] | 357 | [217] | 805 | [800] | 1123 | [1051] |
Bernstein optimized mod | 160 | [221] | 960 | [995] | 416 | [5621] | 341 | [194] | 334 | [217] | 722 | [800] | 969 | [1051] |
x17 % 2K | 137 | [193] | 847 | [1002] | 81 | [340] | 314 | [244] | 300 | [228] | 641 | [863] | 832 | [1012] |
x17 % prime | 173 | [256] | 1010 | [1026] | 104 | [324] | 356 | [246] | 339 | [216] | 760 | [760] | 1046 | [1064] |
x17 optimized mod | 155 | [256] | 915 | [1026] | 96 | [324] | 330 | [246] | 315 | [216] | 691 | [760] | 930 | [1064] |
Implementing open addressing vs. separate chaining
With open addressing, most hash functions show awkward clustering behavior in "Numbers" test:
Bernst. | K&R | x17 unroll | x65599 | FNV | Univ | Weinb. | Hsieh | One-at | Lookup3 | Partow | CRC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
OA | 426 | 81 | 84 | 207 | 88 | 91 | 273 | 110 | 103 | 92 | 1042 | 79 |
[8030] | [20810] | [340] | [3158] | [207] | [480] | [4360] | [342] | [267] | [205] | [20860] | [96] | |
32-bit | 179 | 69 | 74 | 114 | 86 | 80 | 125 | 105 | 99 | 92 | 347 | 82 |
[8030] | [20810] | [340] | [3158] | [207] | [480] | [4360] | [342] | [267] | [205] | [20860] | [96] | |
chain | 92 | 68 | 73 | 82 | 88 | 84 | 73 | 107 | 99 | 95 | 149 | 84 |
[500] | [500] | [24] | [258] | [124] | [48] | [100] | [138] | [131] | [108] | [1530] | [64] |
You can avoid the worst case by using chaining for collision resolution. However, chaining requires more memory for the next item pointers, so the performance improvement does not come for free. A custom memory allocator should be usually written, because calling malloc() for a large number of small structures is suboptimal.
Some implementations (e.g., hash table in Python interpreter) store a full 32-bit hash with the item to speed up the string comparison, but this is less effective than chaining.
215 comments
Have you tried to compare the hash-functions against CRC32? That would be interesting!
Some DSP's can already do galois multiplies which is the slow part of CRC. For PC we'll have cheap CRC in the future when the new SSE becomes mainstream (hopefully).
Cool blob btw..
Most of the function presented there produce entropy of some level at a quantity of 32-bits. What you are doing by mod'ing by 1024 is ignoring the 22-bit higher bits. You should integrate them back into the result somehow.
A true test would not quantize the values. Instead it would just make a list of already generated values and see if those generated values occur again within the period. The period being the size of the "common words" etc.
Another good test is to see the avalanching abilities of the functions. In this case you only change 1 bit in the input and then see how many of the output bits change. The average should be close to about half the number of output bits(eg: 16 bits in the case of 32-bit outputs)
Another good thing to remember is to use random values, than only English or some such as you will see that in the case of English only certain bits in a byte are likely to change, how many times do we use control characters in English words?
I believe if you follow the above you may see different results...
My purpose was to benchmark the hash functions in close-to-real-life scenario. Theoretical tests will give a completely different result.
For example, cryptographic hashes such as MD5 will give a perfect distribution, but they are impractically slow for hash tables. Good results in a theoretical test not always mean a fast hash function.
So, even if the K&R idea is close to the reinvented "simple steps for each character, complex step at the end", the function they published is still not good enough for your "numbers" test, but otherwise is not so bad. And using primes is not so good idea.
But if I remember, the "K&R" code you use is still not the "from the book" one as they don't xor higher and lower bits?
And also, if I properly remember the K&R book, they use what Wikipedia article about "hash table" names "separate chaining". Would you compare how everything behaves then? Of course, for good results, you shouldn't allocate each new node with a separate malloc call.
For K&R function, the version without XOR seems to be better, so I will use it in future tests. FNV authors recommend using XOR, though their function often works better without it. XORing also does not help complex functions, because they already mix the bits.
The choice between separate chaining and open addressing depends on your task. If you need add-only hash table (a symbols table in compiler, a word counting program, etc.), open addressing will be faster because of better data locality for caching.
If you need to delete items from the table, separate chaining may be faster. Or may be not: note that in Python dictionary, they use some complex variation of open addressing, not separate chaining (see Beautiful code book). I will not do the tests, because the results will depend on the usage pattern of a particular application (how often does it delete items? which items? how many times is the table resized? etc.) You should benchmark it on some specific task.
I always used separate chaining for my hashes. Unless you know at the start how big hash you need, separate chaining behaves better.
As expected, x17 is noticeably faster only in your "Numbers" test set. Otherwise the numbers of collisions are very similar. Also note that the pathological behaviour is fully avoided.
Could you please share your code via an upload site (such as RapidShare.de or Zalil.ru)?
Now, your code depends a lot on knowing exactly how many strings it has to process before it determines the table size. And if we use more elements for the same table size my program will run circles around yours.
I've checked now the wikipedia entry I've mentioned (the first time I've just looked up how they call "chaining" and didn't read anything) and I can just tell you -- don't believe them when they write:
"in most cases the differences between these algorithms is marginal, and other considerations typically come into play." ( false )
It looks to me that the author of wikipedia article "invented" that conclusion (that's against the wikipedia policy, but wikipedia shouldn't be used as authoritative reference anyway).
It seems that your statements "The choice between separate chaining and open addressing depends on your task. If you need add-only hash table (a symbols table in compiler, a word counting program, etc.), open addressing will be faster because of better data locality for caching" are also based on that wikipedia article and not on your actual or thought experiments.
Properly implemented, chaining is always much faster in "harder" cases and has comparable performance otherwise. Using it is more important then trying to be smart with the hash function. And of course, in practice there's seldom any benefit of using anything more complicated than the K&R-style function -- in that aspect you were on the right track. Only for very long strings, there can be benefit of processing more bytes at once during hash index calculation. In any real use, the time spent in hash index calculation is not what makes any program slow, not using chains often is (and of course, using mallocs where they are not needed alwasy slows things down).
I'm sorry that I can't upload the code, but it's easy to repeat the results. I've counted the "collisions" any time the string comparison to equality failed.
Separate chaining needs more memory because of the next item pointers (usually 3 times more than open addressing with the same table size), so people reduce the size of the table to save space. Sedgewick recommends setting table size to 1/5..1/10 of the number of items (on average, 5..10 items in each chain). For open addressing, he uses table size = 2..4 * N.
If you make a sparser table, the search will be faster (a classical time-space compromise), and it's true for both methods. Open addressing would also be faster if you take a larger table. I would use not 1/5..1/10, but somewhere around 1/2..1 for separate chaining. Memory consumption should be the same, so it will be fair.
I've counted the "collisions" any time the string comparison to equality failed.
You probably counted them twice (when inserting into the table and when searching in it). That's why your figures for collisions should not be compared with mine.
...based on that wikipedia article and not on your actual or thought experiments
My statements are based on the understanding of computer architecture and caching. Open addressing requires less memory accesses. If the string comparison fails, it reads the next item, usually from the same cache line. Separate chaining reads from non-adjacent cache lines.
There is another way to reduce the number of memory accesses (used in Python dictionary implementation). Let's store a full 32-bit hash in the hash table cells. It would require twice more memory, but string comparison will be rarely needed (only when the hashes are equal). I will probably try this.
About prime numbers: I just found in Sedgewick's book that they were used in modular hashing (not to be confused with multiplicative hashing). "Before high-level languages appeared" (i.e., long before your and my birth :), they used something like this:
UINT hash(const CHAR* str) {
return *(UINT*)str % TABLE_SIZE;
}
Here, TABLE_SIZE should not be the power of 2, or else it will mask out the letters in higher bytes. Today this function is not very useful, because division is much slower than other operations on modern processors. It also will show terrible results in the "Prefix" test.
What is more important is that your program with separate chaining eliminates the worst-case ("Numbers" test). It's really good.
> 2, or else it will mask out the letters in
> higher bytes. Today this function is not very
> useful, because division is much slower than
> other operations on modern processors. It
> also will show terrible results in
> the "Prefix" test.
Wait wait wait, guys..
No compiler does a slow division to get the modulo of a constant these days. No matter if your constant is a prime or a power of two.
Two multiplications and some adds and subs is all you need for any number. Powers of two just need an logical and (as everyone here knows).
So using a modulo is not *that* much slower. I think memory access times are much more important.
Besides that: The number of collisions become very important if the data you hash is large. For things like filenames you can get away with a simple hash, but try to hash megabyte large blobs of data, possible with paging from disk.
You'll be glad about any collision you don#t have in these cases.
3 times more? Of what? Certainly not in my code. If you can't imagine it without actually programming, then try to implement it efficiently -- you'll see then.
> Memory consumption should be the same, so it will be fair.
In my view, to be fair
1) Try to fix the table size before you know how many elements you have to hash (you have to do this for almost any real life purpose). With open addr, you will either use insane amounts of memory "just to be sure" or will have to make new tables each time you cross some percentage of table occupancy. That percentage is important, since the rest of the pointers is going to be unused. And making a new table on these points is also expensive (you have to rehash everything again).
2) Forget about nice power of two table sizes -- once you care about the memory you can't allow the luxury of having each table size twice as big as previous.
Everything else is not realistic enough.
> You probably counted them twice (when inserting into the table and when searching in it)
You're right. So you count "collision" only after inserting? Moreover, should "collision" be every cmp or only "did it hit empty table entry or not"? Personally I'd avoid the term "collision" at all -- it would be better to count some specific operations which have some specific costs (cmp, hash calc etc). Or if we observe "table occupancy" it's also not the full information there once the chaining exists. Etc. More or less boring things -- personally I wouldn't mix the speed tests with the evaluations of the properties of the hash functions or the program strategies.
> Separate chaining reads from non-adjacent cache lines.
What makes you think that?
> About prime numbers: (...) they were used in modular hashing (..) "Before high-level languages appeared"
Exactly! That's why people who knew about that even later tried to keep the table sizes uneven. However I still stumble on the texts that preach the major importance of primality (and I never believed them). On another side, most of the times you don't have the luxury of having hash tables with only of power of 2 sizes, so the "modulo" calculation is hard to be avoided in practice -- that's why I liked to see it in the tests. :) As the added benefit, it demonstrated that dependence of hashes on primes is not really needed.
Btw, I've just checked K&R 2ed to see their example, now you know the whole story behind their
line. :)
(speaking of defines, in your cpp code you should use enums instead)
Re modulo operation, as far as I know, it can't be avoided if the the table size is not known in advance.
Orig code (no chaining, table size twice as big as in chaining):
It's equivalent to j = i - i / 257 * 257, where i / 257 is calculated with MUL. Unfortunately, this optimization is impossible if table size is not constant.
There is also a method for optimizing modulo of 2^N-1 (Mersenne primes). Note that his code is wrong for large numbers. The correct, slower code will be:
UINT mod127(UINT k) {
const UINT p = 127, s = 7;
do {
k = (k & p) + (k >> s);
} while (k > p);
return k == p ? 0 : k;
}
You are right about the case when there is a lot of data. Paul Hsieh said his function is used by Adobe, Apple, and Google. As far as I understand, they use it for large amount of data (such as Google web search index). Complex functions perform better in this case, and his function is one the best for this kind of tasks.
I'm sorry, not 3 times, but N additional pointers for the next items, where N is the number of strings inserted into the table.
About non-adjacent cache lines. With separate chaining, you need to look up the hash value in the index table, then walk to the key-value pairs (see the picture). They are usually located far from each other in memory. If the value is not found in the first record, you need to follow the next pointer, which also can be far from the first one.
Now imagine open addressing with stored 32-bit hash. In case of collision, it will read from adjacent memory cells (usually, no need for string compare), so it may be faster than other methods. I will test this.
A hybrid scheme looks interesting: items in the collision chain will be close to each other, so it should be fast. Another scheme.
> And making a new table on these points is also expensive (you have to rehash everything again).
Resizing hash tables is another interesting topic. For now, let's assume that we know the approximate table size before starting hashing. For example, in a word counting program you could divide the size of the text file by the average word length to get the estimated size.
Thank you for the results; they are really impressive. Why can't you share your code? Zalil.ru is really easy to use (even with no knowledge of Russian language): just choose your file. It will upload it and give you a link on their server. RapidShare is another popular server.
I still claim you didin't try to imagine how it should be implemented.
> A hybrid scheme
Just another bad wikipedia article with unsubstantiated claims, written by the author who doesn't understand the topic, or maybe just edited to stupidity by the following "contributors" -- I'm not interested to check how it happened.
> Another scheme
This at least doesn't fail the "first glance test".
> For example, in a word counting program you could divide the size of the text file by the average word length to get the estimated size.
The perl script is 25 lines, the input files are 5 GB each. Estimate how big hashes should be constructed during the interpretation of the script on a 256 MB machine.
> Why can't you share your code?
My code is not important. I know you were able to understand everything (even without actually programming!) if you didn't stick to your initial idea and code.
You can simply calculate the fast modulo using reciprocals yourself. No need to rely on the compiler optimization.
The code to find the magic constants for reciprocal multiplication can be found here: http://www.hackersdelight.org
The hashtable size does not change that often, so it's not a big deal.
Btw, I may be wrong, but I remember that compilers generate better code for the optimized division when you work with unsigned values for the index and the prime-constant.
Perhaps you could use the following code:
#include
#include
#include
#include
#include
When you first said about modulo optimization, I thought that it's possible to generate the code on the fly. In this case, we can use faster sequences for some divisors instead of general and slower code, but code generation can slow down the whole program. Another solution is to select one of 32 pre-generated functions for prime divisors near to 2^K (one function for each possible value of K).
In most cases, using prime numbers provide similar number of collisions to that of 2^K table sizes (see the table above), so I'm not sure if using these tricks worth a cost. It will not be much faster than the original program, but it's interesting and I will possibly try this. Thank you very much for your contribution.
> the input files are 5 GB each. Estimate how big hashes should be constructed...
I agree with you, it should have some reasonable limit on hash table size. However, the basic idea is correct. It makes difference if you count words in a 20 KB article or in a multi-megabyte book such as "War and Peace". You can use the file size to estimate the size of the table and avoid resizing it too often.
Your function shows mediocre speed in my tests because of using ((i & 1) == 0) condition that was not optimized to branchless code by the compiler. Weinberger's function has the same problem (on old computers, the cost of branch misprediction was lower, so I guess his function was quite fast for his time). On modern computers, branches on random data should be avoided in inner loops whenever possible.
In other aspects (the number of collisions) your function is good and shows quite nice results. You should try to avoid the branch inside the loop.
Finally, this "cache-hits bla bla bla" what somebody wrote and you stuck to is a nonsense, as I've already said. Check what happens in OA collisions: the code can't just play with pointers, it must do string comparisons, so OA in that case is not better.
Just start implementing C.
Thanx for the comments and advice on my function, I'm aware of the speed issues in the published versions the only reason why it and other hash function on my site are described like that is for explanatory purposes. In fact most of the hash functions on the page can be speeded up trivially by unrolling the loops somewhat.
For example when I use my hash function in real life I remove the if statement and process 2 or 4 or 8 bytes at a time rather than the 1 byte per loop. In any case a good optimizing compiler should see that the branching statement is not related to the data but the iterator and from there should be able to come up with a decent execution path. An example of how this optimization can be done:
unsigned int ap_hash(const char* begin, const char* end)
{
unsigned int hash = 0xAAAAAAAA;
unsigned int length = static_cast
unsigned int rounds = length / 2;
const char* it = begin;
for(std::size_t r = 0; r < rounds; ++r)
{
hash ^= (hash << 7) ^ (*it++) * (hash >> 3);
hash ^= ~((hash << 11) + (*it++) ^ (hash >> 5));
}
if (1 == (length & 0x01))
{
hash ^= (hash << 7) ^ (*it) * (hash >> 3);
}
return hash;
}
Note: the above can be done to make of the hash function presented.
That said a purely "collisions oriented" hash function test suite would be great to have, I'd be willing to collaborate with you and others here to come up with a set of tests (BJs or PHs) to definitively test hash functions.
The reason why I say this is from experience, good hash functions from a collisions pov are easier to optimize for speed than good hash functions from a speed pov are upgraded to perform well from a minimizing collisions pov.
> In any case a good optimizing compiler should see that the branching statement is not related to the data but the iterator and from there should be able to come up with a decent execution path.
Peter uses MS compiler. I've tried with gcc 4, and it also can't make such optimization. Which compiler can?
>>Which compiler can?
I'm not sure if any compilers do at this point, that is why I reorganize things. The point I was trying to make was that one-at-a-time hash's should have their loops unrolled (where possible).
About the test suite. What do you think about finding the average collision chain length in addition to the number of collisions?
There are probably other valuable metrics. I'm not so good at math aspects of hash functions (programming always was more interesting for me :), so if you have other ideas, please propose them. You seems to have more experience in math than me. BTW, what are BJs and PHs?
> The point I was trying to make was that one-at-a-time hash's should have their loops unrolled (where possible).
Thank you, I will try to unroll other functions, too.
You did a good job identifying a good examples for weaknesses of OA with linear probing.
Regarding investigation of hash functions and algoritghm properties independently of speed, did you know you can do integer aritmetic in Perl just like in C?
gives
>>> In any case a good optimizing compiler should
>> Which compiler can?
> I'm not sure if any compilers do at this point,
Thanks. Which one do you use? I guess MS?
Anybody to try Intel?
When I said Intel I was referring to the "Intel Compiler" (which is distributed with some licence managing -- limiting software even for Linux).
I've tried the latest gcc from 4.1 branch, and Peter uses VS 2005, but as far as I've seen up to now VS 2008 doesn't introduce some significant optimizations on the function level, that's why I asked about Intel Compiler.
Still, it is possible that gcc 4.2 has some new optimizations (I haven't analyzed what's new in 4.2), so it's worth that you try it.
But my question on that topic is still -- is there anybody to try Arash's original code on the Intel Compiler to see if the bit check of i can be moved outside of the loop by compiler alone?
>>Arash, with this optimization, your function looks much better. I will include it in the next benchmark.
Can't wait to see the new results. :)
>>About the test suite. What do you think about finding the average collision chain length in addition to the number of collisions?
Average collision analysis is a moot point with regards to hash function, due to two reasons, a hash function is inherently a prng so its behavior is best described as a poisson process (when was the last time someone wanted to know the average value the mersenne twister gives?), and secondly average collisions would require a quantizer value - this as you can see is another debatable issue, what value to use? should it be prime or can it be a power of 2(for efficient "and"ing etc)?
As Ace suggested the maximal chain length would be a nice fact to know and a good measure. Knowing the mean chain length, and the varaince from the mean then assessing how close that is to the maximal would be another to know but these should be done with and without quantizers.
With regards BJ and PH, Bob Jenkins and Paul Hsieh, they both advocate analysis of avalanching properties - which I agree is important and does reveal a lot more than basic collision analysis. In short coming back to the prng model for hash functions, given some random string of random length, and a hash function that outputs n-bits, a good hash function as far as avalanching goes (not mentioning strict avalanching criteria) is one that has probability of the i'th bit in the output being 1 close to 0.5. A hash function that has some i'th output bit that is predictable eg: always 1 or most times 1 implies that the output bit is either not being included properly in the mixing process or that whatever mixing is occurring results in it being in said way, a simplistic example of this is if the result of a particular bit is the output of an "OR" gate with inverted inputs (C = A or ~A)
>> Which one do you use? I guess MS?
I use Intel (icc) 10.x, and it doesn't do it. But there is a good reasons why, as far as the c++ standard is concerned there are some issue which regards to reference aliasing that the optimizer in the compiler detects, and as a result is not capable of optimizing away. It may change in the 2009 standard ratification if threading is introduced as according the Sutter for threads to be apart of the language definition the language should provide some guarantees about memory model and some-such this would inherently result in positive side effects for such optimizations.
In some cases I find it better to craft the code specifically rather than relying on some black-box of a compiler's optimizer, you may write something and it may look good in contrived test cases, but then when used elsewhere the compiler may see other possible optimizations and as a result do "other" things. I guess that is why even today you still see people implementing data structures such as RB-Trees etc purely as macros.
Chaining can be used where OH can't. So when implementing only one strategy, not implementing chaining is the wrong choice. And once there's chaining, there's no real reason to implement OH. Second, with chaining, most hash functions will have the same effect -- the difference is hard to recognize in most of the cases. But, where the number of collisions is more important than speed (where access to the data is really expensive), not surprisingly, CRC32 proved to be the very safe bet. And for tables with up to 2^16 elements CRC16 can probably be enough.
Now, Peter, your function gives quite consistent good results in speed and number of collisions, at least in your set of tests! It's amazing that 17, even smaller constant than 33 and 31 went unrecognized up to now. Congratulations! Of course, this will not give significant speed up in practice (unless in cases where somebody unwisely used OH), where widely used functions all have the similar speeds.
But you also nicely demonstrated that the effort to develop the complicated functions (excluding CRC of course) was practically unneeded.
Here is some kind of "overall speed scores" based on the normalized results of the chaining tests with your code (I gave all test groups the same weight, that is maybe controversial) with
that is, the table twice as small as in your OH tests.
(In chaining, only one pointer per string can be enough. The string can simply immediately follow the "next" pointer.)
Compiled with /O1
(lookup3 is the only one really dangerous here -- not to be used unless it can be checked that the compiler optimized what was necessary to optimize to even use the function)
Compiled with /O2
Other common cases not covered by your tests are hashing of integers, and hashing of memory addresses. In both cases there is a limited number of bytes, some patterns occur often and there's still the need to hash fast and good.
Hmm, i've wrote about it in your blog before :-]
http://smallcode.weblogs.us/2007/01/05/what-your-compiler-can-do-for-you/
In most cases divide and modulo (together) can be done using only one MUL
My function differs from the others not only by multiplier, but also by subtracting 0x20 from each character, which helps to "pack" more characters in the hash value. Certainly, this will work only for ASCII strings (you should use a different function for random binary data).
Arash said a lot about PRNG-type hash functions (with avalanching effect), but these functions are actually less effective for short strings than non-random multiplicative functions. If the table is small enough, a multiplicative function can pack all input characters into integer, so the hash value will be unique.
Thank you for the overall score figures and especially for lookup3 test. This function is too complicated for both compilers and humans :).
Do you mean hashing 32-bit integers and memory addresses? In this case, different hash functions must be used. Modulo prime should work well, and the optimization proposed by Nils will be very useful.
Arash, thank you very much for these ideas. If I will have some free time, I will implement them in my tests. You have a good framework for theoretical tests, so you could try them, too.
Prime quantizers are not needed for hashing strings; ANDing will do better.
About avalancing: multiplicative functions are not random at all, but they often show better results than PRNG-like functions.
Take a look at Python sources. They use a simple multiplicative hash function with M = 1000003 (search for "string_hash" in file stringobject.c). In dictobject.c, they say:
Most hash schemes depend on having a good hash function, in the sense of simulating randomness. Python doesn't, [...] but this isn't necessarily bad.
Read also their notes on optimizing dictionaries.
Thank you again for your help. Your function shows much better results after unrolling, and mine also wins from it :).
> i've wrote about it in your blog before
Peter, sorry, I forgot about this. I knew about "magic numbers" for division, but never thought that it can be applied to modulo. BTW, you will need additional shifts anyway, so it's not only one MUL.
Many years ago, at the time I considered such a modification, I believed it shouldn't bring anything. However I never used OH or searched for pathological examples. Have you tried to get the results without that subtraction?
> Thank you for the overall score figures
That was on Core 2. I've made the same scoring on Pentium III, /O2 and received:
Note that here a function with 1.16 is on average 16% slower than the best one. Of couse a bit of reasonable doubt remains as you selected the test sets. :)
In practice the hash functions are called between other processing, so the speed of actual hash functions can contribute even less to the overall speed. In such cases some shorter function can be better even if in tests like yours it appears slower than some with the bigger footprint. On another side, when we really want to minimize the number of collisions, it would be interesting to know "collison goodness factor" from your tests -- using the same method I get the following scores:
Now, I also considered the possibility that your function gets too good score here since you selected the pathological example and possibly optimized your function only for that. So I removed the "numbers" results and the scores are:
Note that most of the functions are very close. Not counting 'Weinberger', all inside of 7% span!
(Still, I'd rather count the number of comparisons and not "collisions".)
> Modulo prime should work well, and the optimization proposed by Nils will be very useful.
How much is the gain of using these compared to a single DIV in processor cycles? Can that method be implemented to work for any N? Can it be programmed to work without programmer manually testing constants (class FasterDiv) where the constants are calculated once (but in the runtime) and then often used?
It was almost always slower:
> How much is the gain of using these compared to a single DIV in processor cycles?
All times are in clock cycles; measured with Agner Fog's program.
> Can that method be implemented to work for any N?
Some divisors require 4 additional instructions (compare 257 and 127). If you have a pre-calculated table of divisors close to the powers of 2, you can select only the divisors that give a shorter code. If it gives a longer code, just select the next prime number, e.g., 131 instead of 127.
Another solution is to use CMOV to ignore the result of the 4 additional operations when they are not needed, but this will be a little slower.
> It was almost always slower
I've asked because the method of using multiplication by constant is widely accepted but using subtraction not.
The "collision score tables" which include minus and no minus versions for various mult factors and using chaining look inconclusive:
Without numbers test
with numbers test
It looks to me that the visible "noise" of the results can not actually lead to some useful conclusion. If that operation really has the properties of conditioning the final result better even for these test cases, I'd expect that every function based on the multiplication to benefit from it? Here, K&R got better with subtraction, x65599 and Bernstein got worse.
Also note that "no subtraction" x17 comes much worse, too much to claim that 17 is much better factor than others.
Note also that 'a' - ' ' == 'A'. As it looks like that most of your test samples contain small letter strings, what you were doing is most of the times just making them ALL CAPS strings and testing the *same* functions!
The only useful real conclusion I have up to now is that functions with multiplication factors are still the good choice for practical purposes.
Yes, it's the combination of the factor 17 and subtracting that makes it the fastest function.
> functions with multiplication factors are still the good choice for practical purposes.
That was the main point of my first article :).
Ace, what do you think about Python dictionary code?
> it's the combination of the factor 17 and subtracting that makes it the fastest function.
But let me point again, I beleive that given two functions on strings where one processes
char - Const
and the other
char
can both be considered as one and the same function. It doesn't look to me that you proved anything, except that the function gives good results exactly for the set of strings you've selected. Wouldn't just "toggling case" of your set before the hashing result in loss exactly where now the gain is? Do you want to say that you've developed the hash function that does good on strings with mainly 'a'..'z' but not on strings with mainly 'A'..'Z'?
> what do you think about Python dictionary code?
I haven't invested much time, so I can't say much. If I correctly see, they have a special treatment for very small dictionaries -- that's very important and a very good thing to do. Then, if I've correctly understood, they don't use chaining, and they resize the table once it's more than 2/3 full. Now if the 2/3 is an optimal limit is something that depends of their collision resolution efficiency. In my opinion you've demonstrated that without chaining and with linear probing even 1/2 can be "too full".
No, it will not be slower. Here are the raw results with upper-cased text files (open addressing, Pentium M):
Mixed case and lower case:
After conversion to upper case:
x17 should be slower on strings with a lot of "\r\n" and other control characters (though I have not checked this).
http://stochasticgeometry.wordpress.com/2008/03/29/cache-concious-hash-tables/
Actually, the number of collisions that Murmur (and lookup3, and any cryptographic hash) produces is in the range predicted by statistics if the hash is truly random. I don't recall the equation offhand, but your x17 does better than predicted because it's less random. :)
Statistically good distribution is important in some applications because it allows me to say "My hash function will not produce pathological results with your keyset" with some certainty even if I have no idea what your keyset is. Bob Jenkin's frog.c test is particularly good at producing pathologically bad keysets - it creates large keys that are mostly 0 bits but with a small handful of 1s - that choke most simple-but-fast hash functions. Murmur happens to hit a nice sweet spot where it is simple, fast, and still statistically strong.
UINT32 HashAdler(const CHAR * data, SIZE_T len)
{
UINT32 a = 1, b = 0;
while(len > 0) {
SIZE_T tlen = len > 5550 ? 5550 : len;
len -= tlen;
do {
a += *data++;
b += a;
} while (--tlen);
a %= 65521;
b %= 65521;
}
return (b << 16) | a;
}
But for me the shortest Murmur variant is a bit of cheating. The starts of the strings in C and C++ are *not* aligned to 4 bytes unless typical malloc or new is made for each string or each string is copied to the convenient buffer before the hash is calculated. So I'd consider MurmurHashAligned2 a real function and I'd do the timings on the set of the strings which are packed in memory one after another, without anything between them (which makes the start of each of them unaligned and not predictable). I'd also like to see the comparison between "copy to the nice buffer then do MurmurHash2" and "just do "MurmurHashAligned2" for strings never longer than e.g. 1024 bytes -- when something is longer it's certainly not probable to appear in unaligned input (this test would probably show the quality of L1 cache?).
------
The need for "statistically good distribution" is exactly the reason why "adding constant" to each source byte should not matter, the reason why I didn't like your claim about the goodness of "subtracting constant" in x17.
------
Austin Appleby mentions on
http://murmurhash.googlepages.com/discussion
that he did chi-square and avalanche testing.
Chi-square test are fundamental tests and should be used for most of experiments. In comp science it was of course used to check goodness of rnd generators probably since they exists.
One possible introduction:
http://www.fourmilab.ch/rpkp/experiments/statistics.html
and the program which uses the test to evaluate the quality of "random" stream:
http://www.fourmilab.ch/random/
http://code.google.com/p/google-perftools/
In addition to tcmaloc, the perftools include some interesting hash table implementations. If you can deal with the limitations, intrusive data structures can be much more efficient.
http://code.google.com/p/google-sparsehash/
Nice go for a 4th time around, but I think we previously discussed as far as timing is concerned if the hash functions are not implemented in a similar fashion then its really not a fair or meaningful experiment. The hash functions from my site are implemented in their most basic definition, for production purposes you wouldn't use them as is you would try and do things like duff's device or something similar to what Jenkins does.
Take for example the way you have unrolled my hash function, its probably the worst way it could have been done, further more why not djb, pjw or others? So as far as timing is concerned unless you can get them all on the same footing it is worthless/meaningless to mention times.
The next issue is collisions, I still don't accept your methods as being generally acceptable though it does seem valid for most situations, as I suggest previously it may be better to get together on this an define a real set of tests, standard inputs and testing methodologies.
In anycase keep up the good work.
x86 processor can access non-aligned dwords. My previous measurements showed that it's faster to read short strings without alignment, so that you can avoid an additional switch statement before the main loop. Alignment matters only if you are going to use the function on a different processor or to hash some very long strings.
Subtracting a constant matters because you then calculate a modulus of the hash table size, so you effectively throw off some high bits. You can save more information in low bits by subtracting ' ' (if you know that '\n', '\r', and other control characters will not appear in the hashed strings).
Certainly, x17 has nothing common with a statistically good hash function; it just tries to "pack" more characters into a small hash value.
Won, thank you, I'm studying SparseHash now. They used quadratic probing instead of separate chaining. It would be interesting to compare these approaches (I will probably do this in future).
The first problem with Adler and Fletcher is that sum2 will be masked away when calculating the modulus. I tried replacing the last line with the following:
return a ^ b;
and got much better results:
The second problem is that the characters are not "weighted" (multiplied by different numbers), so that Adler-32("01") = Adler-32("10"), that's why it fails the Numbers test. Ditto for anagrams in Shakespeare's sonnets: Adler-32("heart") = Adler-32("earth").
So, Adler-32 may be a good checksum for compressed data, but I would not use it for hash tables. Murmur is definitely better :)
No, you just shuffle the values of each input byte (1 becomes 225 etc).
> Certainly, x17 has nothing common with a statistically good hash function; it just tries to "pack" more characters into a small hash value.
Yes, your tests clearly demonstrate that even using simple K&R and separate chaining can often be good enough. Maybe Google searches for a stronger function only because they decided not to use separate chaining?
I haven't analyzed their code, but it can be that they didn't really have to avoid separate chaining -- that they were able to keep the same memory usage, or even have better behaviour in cases when the number of input elements is not known in advance.
I'd still enjoy to see any real life example which demonstrates the real need for statistically strong hash function, provided open addressing is avoided.
http://www.augustana.ca/~mohrj/courses/1999.fall/csc210/lecture_notes/hashing.html
http://www.boost.org/doc/libs/1_35_0/doc/html/intrusive/unordered_set_unordered_multiset.html
Boost, of course, has lots of interesting stuff.
I mentioned Cuckoo hashing as an alternative chaining method to Peter in an e-mail. There are variants that can sustain very high load factors.
http://en.wikipedia.org/wiki/Cuckoo_hashing
For Adler/Fletcher: ah, that makes sense. I suppose the prime mod versions would work much better. But, I don't think it is correct to say that Adler/Fletcher cannot distinguish between permutations (think about how 'b' accumulates). I also don't think a^b is a particularly good mix for Adler/Fletcher. There is the problem with how you compute the bucket from the hash. If you need N bits, maybe it makes sense to combine N/2 bits for a and N/2 bits from B. Maybe this is as simple as taking the middle N bits from an Adler/Fletcher hash, rather than the least N bits.
One variation is to allow for multiple items in a table entry (similar to a multi-way cache) before rehashing those elements. This variation is certainly not unique to cuckoo hashing, but seems to work well with it.
Anyway, my discussion is mostly from memory, but these postings are encouraging me to look into it myself...
For short strings, the higher byte of 'a' is zero, so if we take the middle N bits, lower bits of 'a' will be lost. It makes sense to calculate a ^ (b << (N/2)).
I've just tried a ^ (b << 4) for these test files, and the results were comparable with Murmur:
Thank you very much for your ideas.
AFAIK the tables there also clearly demonstrate the superiority of chaining? See what happens when load factor increases.
So I'd still like to see any good argument to use open addressing, especially since it demands more complex hash function only to be able to perform acceptably and then it even requires using this function more often, and behaves exponentially worse as load factor even approaches 1 (I'd call that lose-lose-exp(lose) scenario :) ).
However, there are some advantages to open addressing. A non-intrusive C++ hash table implementation that uses chaining (e.g. tr1/unordered_map, based on SGI STL hash_map) has to allocate nodes to store the data + the link pointers. This means memory allocations, copies, and indirections that contribute to the constant factor for chained hash table implementations. For typical load factors (1/2 to 2/3), open addressing (google-sparsehash uses quadratic probing) can be faster.
http://google-sparsehash.googlecode.com/svn/trunk/doc/performance.html
There is also the slight advantage that open addressing can be slightly more compact, but I don't know how important this actually is. Sustaining a high load factor is probably much more important than avoiding pointer overhead, so chaining might be even better in this regard. If you really care about compactness, something like Judy might be better:
http://nothings.org/computer/judy/
An intrusive hash table implementation avoids most of the practical problems of chaining. Since you avoid all those copies, allocations and indirections. These are much faster than any of the non-intrusive options, but they are not appropriate in every situation. It is not always possible to instrument classes to be used in such an application. The instrumentation overhead affects all instances, not just those in hash tables. Ownership semantics are very different, and that can lead to design subtleties (amplified by concurrency).
Moreover how can the load factor which so influences open addressing be kept constant without introducing even bigger performance penalties, when even slight changes in load factors force rebuild of the whole container?
I agree with you that ownership is very important. Personally I don't think it's smart overusing smart pointers in C++ and that when somebody doesn't want to care about ownership he should not use C++ at all but some language which has "natural" GC. For C and C++, I think the best results are achieved by maintaining a distinction between owning and referring containers. The compiler writers knew about that since forever I guess. And they certainly had to maintain a lot of hashes.
Is it possible for you to explain (if you know) what are the exact design decisions behind "sparse hash" since I fail to grasp them from the site? Only then it can be discussed if there's a possibility for improvement, or what the major contribution of that implementation is.
Thanks in advance.
Open addressing does not require additional information to handle collisions, because the "chaining" is implicit, based on the probing strategy. Objects are copied into the container without need for modification or ancillary data (maybe a little bit to indicate empty/full).
Explicit chaining requires some kind of pointer. Each bucket forms a short linked-list of elements. You can implement that linked-list in several ways. The non-intrusive way (like SGI hash_map) makes a node that is essentially a pair of the object + a pointer to the next node. The intrusive way (like Boost intrusive_unordered_map) is to have a special field within the object itself to point to the next object. So for STL-compatible containers, this isn't necessarily a huge win, since they are value-oriented. However, a non-owning, pointer-oriented intrusive hash map can be very fast because it avoids almost all copies and allocations.
BTW, ownership is always a design problem. GC does not automatically solve it for you.
Specifically, it seems that his problem was that when STL hash_set is used to *own* values (of some big structures?) a lot of memory was taken by placeholders for unpopulated entries.
That's why he designed the "sparsetable" which spends (more-or-less) a single bit to mark an used or unused entry in it, and allocates space to hold only the assigned values it owns. Then he used that "sparsetable" as the underlying table for the open-addressing hash table implementation.
So his goal was to fit as much as possible 'indexed' (with a hash function) structures in the memory without even spending a single pointer per structure kept, giving away performance to get the lower memory footprint.
Nice.
I've accidentally discovered the error in the implementation of "HashWeinberger." The line
is wrong, as >> has lower priority than ^, and that causes significantly worse results. The original code in the book was:
But even their version doesn't look too promising as the last line in their function is:
where the prime is defined as 211. But at least that suggests that it was not meant to be used in "open addressing."
Another problem with the hash.cpp: either all or none of the functions should use the "fix" as:
otherwise there is no comparing under the same conditions.
Thanks for your comment. MSVC generates the same machine code for
andOperator ^ has lower precedence than >> (^ is lower in the precedence table). You probably changed something else in the source code that affected performance.
I tried
h ^ (h >> 16)
for K&R function, and it was slower this way:For Larson's hash, I just forgot to try it (XOR version is faster). Thank you for reminding!
You are certainly right, my error! This was without thinking from my side, as I was used to see x >> y + z where the author wanted to >> before +. However ^ is really weaker than >>.
To excuse myself, this lapse occurred as I wanted to make my small contribution:
D. R. Hanson uses in two of his books, since around 1997 up to now, e.g. here:
http://code.google.com/p/cii/source/browse/tags/v20/src/atom.c
the following hash function:
And it's very poor! Your tests immediately demonstrate the (this time, real) error.
No problem.
The comments are plain text now. I've rewritten strchr.com from scratch, as I promised to do earlier. Real minimalism: around 1500 lines of PHP code written in a couple of weeks, no bloated CMSes or "web frameworks". All your comments were saved, as usual. Later, there will be a WYSIWYG editor for comments and a button to publish your own article :)
Xkcd recently brought to my attention "Collatz conjecture" (http://en.wikipedia.org/wiki/Collatz_conjecture)
Using programming language notation, giving the following Python function (where numbers are of unlimited number of bits):
the conjecture states that for every n the loop will eventually terminate.
One way to look at it is: the 3*n+1 would be an acceptable hash function step where each input would be a single 1. (of course using unlimited number of bits is not for practical hash functions). The n >> 1 branch removes all zero LS bits, if they exist, else the "hash function" is applied (note also that by the construction of the loop every "hash" step results in at least one zero LSb). So the conjecture can be restated "applying 3*n+1 transformation step over the number with unlimited number of bits and the transformation which removes LS zero bits from the number in one moment there would be exactly 1 set bit in the number."
Of course I don't see any practical aspect of all this, except how a lot of experiments (in wikipedia article) show the "bit randomizing" properties of the multiplication with the odd number.
And that's what's wrong in Hanson's hash: he uses an even multiplier and in every step he loses information about previous steps(!) Using the "scatter" mapping trick of input byte doesn't help at all -- the multiplier is the engine, here the broken one.
I suggest adding the function used by Hanson to your set of hashes in hash.cpp. The results nicely demonstrate how a faulty hash function can be designed, demonstrated in the books and used for a long time:
The serious weakness is visible for your "Postfix" set (the strings with the same long postfix).
Finally, just for completeness, I propose one more function. I named it Novak Hash, and it's basically what should have been done to still do the "scatter" approach and to implement the function right (that is, not losing significant information about the previous values in every step and reducing the size of the table):
I just consider it "how Hanson's function should have been done," I still believe that functions that don't use any tables are better in almost all real-life circumstances.
Bret Mulvey
Evaluating hash functions
http://home.comcast.net/~bretm/hash/
The downside to just testing general use cases is that programmers who don't know any better will see a hash function that works for a particular situation, spread it around, and it will end up being used for things it shouldn't! e.g. the huge number of hash functions that exist now, many of them in wide use despite being pretty bad outside of a certain range of keys. Not to say that I know of a good universal metric, but work towards one would be good.
Thank you very much! CRC got the 4th place with your optimization. Its results are very balanced (a low number of collisions; no spikes on any test).
AFAIK, the best we can do is to test the functions on different sets of strings (both real-world and synthetic). I tried to find the sets that cover most common situations and are very different statistically: English words, function and variable names, sequential numbers, etc. If you have other ideas, I will be happy to add more tests.
For more "real world" a list of utf-8 words with multi-byte encodings would be good. I've also found a binary list of ips is good at screwing up hash functions, but that will need special consideration for loading as they can contain nulls and carriage returns.
I have reconsidered Novak Hash and I'd like to update it: instead of initializing with h = 0 I'd like h to be initialized with h = 1. The reason is that as one of the values in S-Box is 0 when strings of different lengths but only of the byte which maps to zero are hashed the pathological case occurs.
Similar pathological case can be constructed for all hashes that have multiplicative constant, if the initial value of h is 0. E.g. for x17 if it subtracts 32 of the byte it degrades to the pathological case when a lot of strings consisting only of spaces of different lengths are hashed -- all will map to the same slot!
Yes, R, RR, RRR, etc will not hash to the same value with h = 1, but existing collisions will still collide. Set h to whatever you want and hash "novak" and "qovAk", they'll both have the same hash.
You can continue plugging holes, but you'll only slow it down. If you really want to go the Sbox route, wouldn't a full 32 bit sbox like http://home.comcast.net/~bretm/hash/10.html be better than single bytes?
What can be considered a "good enough minimum" is is the most interesting question indeed! As the assemblers and compilers had the limitation of only 6 letters per symbol even just summing the letters was considered "good enough." Now, it appears that the demand for full cryptographic "avalanche" effect on all bits is not always necessary -- I'd really like to know at which point it becomes important. Avoiding some "obvious" problems with simplest functions should help.
That's the idea with which I experimented with 8-bit S-Box, to find if something can be "good enough" with smaller cache footprint (and something that's maybe more convenient for embedded systems limitations? I'm also preferring to the "endianess independent" implementations). Still, for my tastes, avoiding having tables at all can be better whenever the function is not very frequently invoked -- tests with the "full" cache don't reflect the a lot of typical usage scenarios I can imagine.
I'd really like to know where the limits of every function are.
I believe fixing the simple multiplicative functions for some common scenarios with the different initial value can be the "good enough" solution for some usages?
BTW, I also know that AES primitives are being increasingly implemented in hardware. I admit I haven't checked if S-Box mapping is actually part of the primitives implemented by Intel?
I know CRC32 is implemented in some new Intel CPU's, and I'm very interested to see how your implementation compares to the one using the specialized instructions.
Andrew, I've added the tests with UTF-8 and IP addresses (in binary). Fletcher's hash is terrible for UTF-8 texts because of repeated first bytes in multi-byte sequences. I did an additional test with a Russian novel, and Fletcher was even worse. The results of other functions were not very different from the previous tests.
Ace, I've updated Novak hash and x17 (h = 1). AES instructions implement the whole round of encryption, not just substitution using S-boxes.
CRC32 is implemented in Core i5-i7 processors, but with the iSCSI polynomial, so it is useless for ZIP, PNG, MPEG-2, and many other formats, which use a different polynomial. Though it does not matter much for a hash function which polynomial to use. I will probably try accelerated CRC32 later :)
Could any of the AES instructions be used for hashing too? I can't seem to find any speed comparions for the SSE CRC32/AES instructions at all.
As Peter linked, AES instructions seem to be too heavy for hashing, as they do much bigger work, which is not surprising, the substitution alone is very effective as soon as the table is in cache.
I've just done the tests on Core i5 processor (see the results above). Hardware-accelerated CRC is the fastest hash function; x17 is slow on this processor for some reason.
I've also optimized the handling of remaining bytes in your CRC32 code (using two conditions instead of the loop). It's slightly faster this way.
The hardware CRC32 numbers are fairly impressive. That is why I wondered about AES, if any of the ops are fast enough then they could be quite good.
Here's what I've found about the speedup of CPU-accelerated AES compared to non-CPU accelerated AES: it is approximately "just" factor 4, at least according to:
http://wiki.debianforum.de/BenchmarkFestplattenverschlüsselung
and according to Intel, up to 10 for multi-core scenario:
http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-instructions-aes-ni/
As AES-NI instruction does whole round on 16 bytes and there is a lot of processing for one round and the typical speedup is only 4, I believe AES-NI is therefore still irrelevant for the simple kind of hashing that's the subject here.
Peter, do you have then I5-750 instead of any of I5-6*'s (which are all supposed to have AES-NI, at least according to: http://processorfinder.intel.com/List.aspx?ProcFam=3155)?
I didn't know there are I5's without AES-NI at all until now.
AES-NI results are impressive. As I found, it's supported by PGP, DiskCryptor, and other popular encryption software. GnuPG does not use the hardware acceleration; most likely, because they want to stay portable.
Yes, I have Core i5 750 (four cores, SSE 4.2, but no AES-NI).
P.S. Here is a new article with detailed CRC32 benchmarks.
http://amsoftware.narod.ru/algo.html
http://herakles.zcu.cz/~skala/publications.htm
especially:
http://herakles.zcu.cz/~skala/PUBL/PUBL_2010/2010_WSEAS-Corfu_Hash-final.pdf
http://herakles.zcu.cz/~skala/PUBL/PUBL_2010/2010_Corfu-NAUN-Hash.pdf
Perhaps will behelpful.
About visualization: I wonder how to do it, too :) If we plot the low 16 bits on x axis and the high bits on y axis, there will be not enough pixels to display them on a web page (need 216 × 216 pixels).
I know it's not needed to do that when some specialized solution is considered, but it's interesting playing with the experiments that are supposed to give us more insight about the behavior of the functions.
I hope it would be interesting to you to see FNV-1A(13bit) used in my intoxicatingly fast word-list ripper Leprechaun at http://encode.ru/threads/612-Fastest-decompressor!?p=22184&viewfull=1#post22184, where 22,202,980 wordlist of latin-letters-words is proceeded.
also 'FNV1A_Hash_Granularity' at http://encode.ru/threads/612-Fastest-decompressor!/page3.
I salute the host Peter for this site: it is most comprehensive and useful for sure, keep going.
after some minor changes in your Hash17_unrolled here comes the 'Alfalfa' tuned for latin-letters-words i.e. alpha strings between 1 and 31 chars.
This variant behaves surprisingly well when 13bits(8192 slots) are used.
With my testbed(22,202,980 distinct words) which uses 3-level hash(1st: 26 slots for first letter; 2nd: 31 slots for string length; 3rd: 8192 slots given to some multiplicative hasher) Alfalfa outperforms Hash17_unrolled by: 18,677,243 - 18,645,799 = 31,444 less collisions.
In my opinion every hash variant must be tuned specifically for a given data-set and to be regarded as a potential gem(not as a common-stone only because for other set it performed poorly).
A suggestion: putting one more column to your result table containing all latin-letters-words encountered in wikipedia-en-html.tar.wrd (12,561,874 distinct words in 146,973,879 bytes) would give a better look on collision-performance on heavy(real) loads along with light-weight Shakespeare's sonnets, don't you think?
If you are interested, my testbed is here: http://www.sanmayce.com/Downloads/Leprechaun_r13+++++_EXE+ELF_vs_Wikipedia.zip
regarding speed I offer two extra fast hashers(tuned for 8192 slots):
regarding collisions it's up to your testbed, nevertheless I expect FNV1A_Hash_4_OCTETS to be a gem.
Peter, if you find it useful please put it along with generic FNV-1A.
Poorly performing functions are Fletcher and Novak. Not-so-good functions: Weinberger, x17, and Paul Hsieh. For this test, I used Georgi's list of all words in English Wikipedia (12.5 million words). You can try other word lists (the results will be similar) by running the benchmark with
/c
command-line switch and processing the output with get_google_chart.py script.For small number of bits, all functions are equally poor because of the birthday paradox. I didn't implement the 32-bit hashing (it would require a rewrite of the benchmarking code), but the results should be similar to the above.
Georgi, I followed your suggestion and benchmarked a large word list (wikipedia-en-html.tar.wrd). Thanks for sharing your testbed.
Hardware-accelerated CRC and Murmur are winners again. Novak and Fletcher are at the bottom of the list because of high number of collisions.
Your unrolled version of FNV1A showed very good performance on small data sets. However, it has a high number of collisions in Numbers and IPv4 tests, so I cannot recommend it for general usage. Multiplications in the original FNV "mix up" the input characters to achieve avalanche effect. Unfortunately, when we remove some of them, this effect is lost.
About tuning for a given data set: there is perfect hashing (no collisions). It's used in parsers and other applications, where the list of words is fixed. For example, imagine a hash table for C++ keywords (if, else, for, class, etc.)
In the end, the winners didn't change: Murmur2 and hardware-accelerated CRC32 are the fastest hash functions. SBox looks promising, too.
Speed in percentage of regression (reference: the fastest non-hardware based):
==============================================================================
iSCSI CRC -11.27
Murmur2 0
x65599 1.75
Paul Larson 2.75
Alfalfa 3.35
FNV-1a 5.38
Hanson 8.55
SBox 11.27
CRC-32 11.52
K&R 11.82
x17 unrolled 14.42
Bernstein 14.43
Alfalfa_HALF 15.95
Sedgewick 17.17
lookup3 19.09
Paul Hsieh 19.44
Ramakrishna 26.11
MaPrime2c 30.04
One At Time 31.11
Arash Partow 31.78
Weinberger 44.85
FNV1A_unrolled 45.24
Novak unrolled 231.91
Fletcher 247.25
Collisions in percentage of regression (reference: the best):
=============================================================
Bernstein 0
CRC-32 0.04
Alfalfa_HALF 0.15
iSCSI CRC 0.17
Paul Larson 0.28
Sedgewick 0.31
FNV-1a 0.34
Murmur2 0.35
K&R 0.43
SBox 0.47
MaPrime2c 0.49
Arash Partow 0.5
lookup3 0.51
One At Time 0.66
Ramakrishna 0.92
x65599 1.38
Hanson 2.68
Paul Hsieh 5.11
Alfalfa 5.88
x17 unrolled 16.22
Weinberger 70.72
FNV1A_unrolled 193.48
Novak unrolled 204.62
Fletcher 336.97
Looking at it this way, Murmur2 is less than 3% faster than Larson, which is a simple constant: 101. Then looking at the number of collisions, there are a lot of functions which are different less than 1%, and Larson again scores very well. So my current conclusion would be, if I'd have to remember something for hashing up to a few million of English word-like entries, I'd just remember Larson's constant 101 (decimal) and stop worrying!
The graph is nice. The biggest visual insight is how so called "FNV1A_unrolled" becomes very bad at some fixed point and how much worse than most some functions are.
What graph shows is just the dependency of the number of collisions from the table size (table size == 2 ** bits) for different functions and a fixed large input set, and it's not unexpected that you can't see much for the table significantly smaller than the input set size. That's why I suggested something else: designing a function to always produce the input set for the given number of input elements.
I'd suggest that the set would have: words (40%) fixed string + words (20%) words + fixed string (20%) pure binary numbers, increasing (20%) Not because it's a good representation of any real input set but because we've already seen that it nicely amplifies the weaknesses. Then the goal is to select enough words from the 12 millions for any given target size. I'd sort the Wikipedia words with the sort function that as the highest criterion uses the number of letters and then I'd simply take the first N (where N is target table size * 0.4 for the first set etc.) Then, my initial idea was to fix the number of the input set elements and draw the separate graph like yours for such input sets. That was to allow visual comparison of functions even for smaller input sets, where we don't "see" much now.
very informative and well done(especially the graphical chart).
Allow me to give some of my thoughts regarding collisions:
Looking at the chart I see no difference for all functions up to 22bits, which is something new for me and in the same time very good(I expected 16bits to be the upper limit for good behavior) because in my view 13-20bits are the most targeted range.
And for future tests(64bit compiler needed because long long is used) it would be interesting to show the performance of 'FNV1A_Hash_Granularity' when only 'FNV1A_Hash_Granularity(wrd, wrdlen>>3, 3);' invocations are used for all lengths. I have made tests with 'FNV1A_Hash_8_OCTETS' with 32bit(not 64bit) multiplications and the speed was crazy, but with high collisions(due to lost carryings).
Below is the new 'Alfalfa_DWORD': your function 'x17 unrolled' boosted even more with best collisions result on my biggest(22,202,980 latin-letters-words) test and very fast too.
A 64-bit benchmark would be interesting, and I will do it in future; thank you for the idea. Most functions are tuned for 32 bits, but some should be fast in 64-bit mode.
There is no difference between hash functions for small number of bits, because the benchmark attempts to squeeze 12.5 million words into 221 ≈ 2 million cells. It's my fault :( Not only hash table size, but also the input set size should be reduced, as Ace proposed.
Ace, it would be easier if you do the visualization as you wanted. You can freely use my code and Google Charts API to draw the graph. I'm very interested in the results and you can publish them here.
I agree that Larson's hash scores well on most tests, and it's very simple. Thanks for the scaled results on the large data set, they are much more readable.
svg (I use Opera, haven't tried others): http://pastebin.com/raw.php?i=btMpwEkb
png: http://i54.tinypic.com/2ryse1w.png
What that says to me is that your original tests contain too much "noise" compared to the "signal" (at least for my taste!). But they still helped detect some serious problems in some of the functions, and all together some very insightful steps came as the topic developed. Peter, thank you for maintaining this topic, its educational value is really high!
I guess if it is written in assembly(to avoid these ugly BYTE extracts) the picture will be different, also due to short strings(the average word length in Wikipedia-en is 10 bytes: 2 loops within the first cycle, which could be unrolled too, but I don't like such a 'tuning') speed-performance of 'Alfalfa_DWORD' and 'FNV1A_Hash_4_OCTETS' suffers.
When multi millions of keys are involved, I firmly believe that a cascade of hash functions must be used to ensure a good(no worst case allowed) distribution/dispersion, then 22bits are quite enough, in my case instead of using some 24bits I chose using 5-bits/5-bits/13bits. Here the principle 'know your data' comes as a rule. As for the birthday paradox it supports such an approach fully. In short: an application designed for speed must not rely on some kind of general-purpose functions but on uncompromising/tuned ones.
Talking of 'FNV' variants(very well defined at Mr. Noll's site - the home of FNV), I think there is a niche where even 128bit(with greedy 64bit multiplications used similarly to 32/64bit counterpart, in order to gain speed as in 'FNV1A_Hash_8_OCTETS') would be interesting to be explored/tested.
As you can see, they are very different from the results on Core i5. So the reason is differences between microarchitectures.
I'm an engineer. Here's how we solve such things: everything that's inside of the 10% span can be considered the same! We should first drop the functions for which we know that are obviously bad, even when they are good in some special occasions. That means, Novak, Fletcher, FNV1A_unrolled, Weinberger must go, but also Hanson (Interesting: the nature of the input set of Wikipedia words only is quite forgiving to Hanson function for which we know that it loses bits in every iteration) and Hsieh and x17 according to the collisions graph. Then, from the remaining ones, we keep those that are inside of 10% staring from the first remaining, consider them *the same*, then inside of the next 10%, consider these the same but belonging to the lower group, the rest we drop. We repeat this in all 4 cases. Which are then always among first 10%? Which are sometimes in top 10% sometimes in other 10%? We need to know only 4 resulting groups. I proposed the "initial drop" group. Now let's make the "always in top 10%" "less good" and "even less good" groups. Which function is where?
(If this approach works, we can then generate the tests on a few more platforms and use these scores too. Again the winner is any function that is still in "always top 10%" (if there is such, if not, then those in "less good" etc). The processors as MIPS and ARM are certainly something that is more different than all x86. I can run something on my MIPS-based router, but only if it doesn't use more than around 10 MB of RAM and can be compiled with GCC.)
"Looking at the chart I see no difference for all functions up to 22bits, which is something new for me ...",
I was thinking of my own approach and talking of something else/obvious, from time-to-time I do such blunders.
Of course when 4,194,304 slots are given with incoming 12,561,874 keys the resultant collisions equal (12,561,874-4,194,304)..(12,561,874-1).
What was on my mind: the slot[s] out of 6,602,752 with the MAXIMUM number of collisions - my explanation for such thinking failures: a mind troubled by anxiety.
Again, the thing I am/was interested in: to examine/find the slot[s] with MAXIMUM collisions(in fact the depth) for each bucket(hash table) - this is crucial for any further processing whether linked-lists/binary-search-trees/b-trees or another hash.
In my view the useful info(even when keys<=slots) is not the hash table utilization but the depth(how much it is nested). For example if 2^20 slots are used with 2^24 keys I care of the peak: the ideal case is 2^4 depth or 16 layers one megabyte each. If 2^13 slots are used with 2^12 keys I care not so much for zero collisions as for not existing a slot with depth>some_THRESHOLD, of course the primary goal is to hash the keys at highest possible speed.
Peter, if you find it useful(in sake for gaining more experience) please include and this one(as 32bit or/and 64bit):
And also for research purposes the eight-bytes-at-once-FNV-variant:
Enough, I am stopping with offering more functions.Now I'd drop these that scored 4 in "short" tests (more than 100% worse than the best for the given test): Alfalfa, Alfalfa_DWORD, Partow, One_At_Time
Second, I can drop the only ones that scored "3" in i5big test: MaPrime2c and Ramakrishna (they are obviously worst in pmsmall test too)
Then, I can drop the only one which never scores 1 in small test: lookup3, it's good since it scores always 2 in i5big and pmbig. That leaves:
Now, using "big" results it's obvious that the winners are Larson, Murmur2 and x65599 (the only three that scored 1 by both big).
Using "small" results, we can't see much. We see that SBox and CRC-32 never score 3 on PM (meaning that the table lookups are really fast there and that the resulting behaviour is good). Otherwise, every function sometimes scores 3. That again can only mean that these measurements were "too fuzzy" to be of too much use. And that leaves open if using short tests is relevant enough, in the form they are now, and if we were fair to drop some function that scored 4 there and were 1 in bigs (Alfalfa, Alfalfa_DWORD).
If the small tests are relevant enough to mean something and if we believe bigs more, Murmur2 is the best, the second is Larson, the third x65599.
However if we look at how complex the functions are to implement, x65599 is the simplest, requiring no fast MUL in the CPU, the second (as it doesn't require alignment) is Larson, the third is Murmur2 which has the highest demands on the hardware.
The conclusion:
1) It seems that x65599 is confirmed to be both very good and simple. Larson is the second simple function. Murmur2 can be better as soon as you have aligned strings and use modern CPU, or if you have some demands not covered enough by all the above tests.
2) The "small" tests in the current form still look "too fuzzy" for my taste to trust them more than as to point "what's not good" (if even so much). I don't know how they can be improved, instead, I'd rather modify or add inputs to "big" tests in order to have more values that we saw that are able to detect the problems in functions tested in "small" (prefix, suffix, numbers, binary numbers).
FNV-1a, Larson, Murmur2, x65599, Alfalfa_HALF, Bernstein, CRC-32, K&R, SBox, Sedgewick
you use, the worst that can happen in the tested cases is to have sometimes the slowdown of 2 compared to the best possible outcome tested, most of the times you can expect to be inside of 20% difference, so if you use one of the above functions in the scenarios close to the tested ones, you practically don't have to worry if you selected the right one. Factor 2 difference is not much compared to much bigger differences (i.e. factors 100) that are possible to be achieved by only allocating too often, for example.
So in my opinion, as long as the practical "speed competition" can't be organized to consistently demonstrate bigger obvious differences, it seems that we shouldn't worry too much about the speed.
The damage is undone, the ugly inconsistency(Mobile vs i5) exterminated, the result for 25bits(very useful testbed hash.cpp Peter, thank you) on Pentium Merom is: Peter please update(if you find them useful) above functions, they are now finished.
I agree that the difference between, say, SBox and CRC-32 (in software) is negligible, but try to compare Weinberger and Murmur2. Factor 2 difference is noticable for an end user. Also remember that I excluded some really bad functions from the test (e.g., Two chars). So, the choice of hash function matters, and we should explore different architectures. Fully agree with you about the best function, which is Murmur2; Larson and x65599 are nice, too.
Georgi, the updated FNV1A_unrolled is much better at Wikipedia test, congratulations! Regarding the slots with maximum collisions. Red Dragon Book proposes the following formula for evaluating hash function quality:
where bj is the number of items in j-th slot, m is the number of slots, and n is the total number of items. The sum of bj(bj + 1) / 2 estimates the number of slots your program should visit to find the required value (I used a counter for the number of visited slots in earlier version of the benchmark, which is similar). The denominator (n / 2m)(n + 2m − 1) is the number of visited slots for an ideal function that puts each item into a random slot. So, if the function is ideal, the formula should give 1. In reality, a good function is somewhere between 0.95 and 1.05. If it's more, there is a high number of collisions (slow!). If it's less, the function gives less collisions than the randomly distributing function, so AFAIK it's not bad.
Here are the results for some of our functions:
Murmur2 is the best, again. I wrote a Python 3.x script (count_quality.py) for generating such charts.
I also like how it's nicely visible that numbers, prefix and postfix are the best stress tests for a lot of functions.
Regarding "factor 2": it's the fact that your "small" measurements can't point to a single function which stays *always* on top. That's why I rejected those that certainly make more than factor 2 (that score 4) or exhibit problems. The remaining ones make less (score 3 means: 20%-100%) and even that not often -- note that in your PM big test all functions:
Alfalfa, Alfalfa_DWORD, Alfalfa_HALF, Bernstein, CRC-32, K&R, Larson, Murmur2, Partow, Ramakrishna, Sedgewick, x65599, FNV-1a, MaPrime2c, One_At_Time, SBox, lookup3
are not more different than 20% from the best one, and most not more different than 10%! And that's for big inputs. For small inputs, there's practically no chance that you'll notice the difference. For the big inputs too, it's not worth the invested time to consider them: 3 seconds or 2.6 seconds? Who sees that? Most programmers make orders of magnitude bigger obvious overheads than that.
Therefore I like any direction that would either demonstrate more obvious time differences or provide some other relevant criterion for evaluation.
Regarding i5 measurements: I suggest you (as I always did): don't use the micrometer to measure the distance between cities. Produce the tests that take enough time to be able to measure them using independent clock (QueryPerf..) by simply making the test running long enough (don't measure too small inputs, they deceive, or if you do need to measure small inputs do a lot of small different inputs -- like always using a new part of wikipedia words, and the new numbers values etc). Then, if it's still needed, make the explicit overhead of some operation to make the meaningful differences more obvious. For example, make the cost of fetching the values further in the same chain higher (before they are compared with the input value). That's reasonable to expect as soon as your hash table and values in it are much bigger than the CPU cache. See what you get then... Then add allocations -- see if that makes any other differences irrelevant (nice to know) etc...
If you add allocations or some other "overhead", the result will be completely unpredictable, because you will be measuring not only your hash function, but also the allocator performance. I'm moving in another direction: trying to eliminate all side effects on Core i5 (on Pentium M, they are already eliminated).
Very large in-memory hash tables are rarely used in real-world programs. Large databases are typically implemented with B-trees, but I'm interested in small hash tables such as a symbol table in compiler.
Do we see graph for all n? I thought it's for only one n? It would be interesting to see how the results vary as n changes.
> m = 2 × round_to_next_power_of_two(n)
if you target the "real life" scenarios, how do you expect your compiler to know n in advance to determine m? AFAIK compilers typically have fixed m or rehash as little as possible and then you can vary only n. If m is not fixed then you have to add "rehashing" too to your measurements.
And what kind of the compiler do you target? Modern C++ compilers now have very long (decorated) identifiers and awfully lot values thanks to the libraries and approaches like boost. All Win32 and framework headers were also quite big before, but some orders of magnitude smaller. It has sense to measure for a lot of values. Long too.
> trying to eliminate all side effects on Core i5
I'd just make the test to last at least a few seconds of calculation for each measurement -- if I'd measure for small tables, I'd feed them with various inputs (now with the big set of words, there's enough to use, for every pass the new ones!) and measure the total time. I'm quite sure that gives much more repeatable results, and it can show better differences too. It's simple: if you use too little to measure you measure more noise than the signal. And of course I wouldn't measure the ticks but the "real" time between the start and the end of every such calculation.
Ace,
- 'HashFNV1A_unrolled' is a pure FNV-1A variant;
- they are different, don't you see the XOR, generally I don't read enough, that is I like to experience the things myself, so a pseudo-plagiarism is possible.
HashKernighanRitchie:
hash = 31 * hash + key[i];
HashFNV1A_unrolled:
hash = ((hash ^ *(DWORD*)str)<<5) - (hash ^ *(DWORD*)str);
And whether 15,17,31,33 the choice is a kind of gambling.
First improved function is 'FNV1A_Hash_WHIZ': FNV1A_Hash_4_OCTETS with FNV1_32_PRIME = 1607 or near-by as 1579.
Second improved function is 'FNV1A_unrolled_Final': post/second cycle in 'FNV1A_unrolled' - just unrolled.
'FNV1A_Hash_WHIZ' follows:
#define FNV1_32_INIT ((UINT)2166136261)
#define FNV1_32_PRIME ((UINT)1607)
#define FNV_32A_OP(hash, octet) \
(((UINT)(hash) ^ (unsigned char)(octet)) * FNV1_32_PRIME)
#define FNV_32A_OP32(hash, octet) \
(((UINT)(hash) ^ (UINT)(octet)) * FNV1_32_PRIME)
UINT FNV1A_Hash_WHIZ(const char *str, SIZE_T wrdlen)
{
UINT hash32;
const char *p;
hash32 = FNV1_32_INIT;
p=str;
for(; wrdlen >= 4; wrdlen -= 4, p += 4) {
hash32 = FNV_32A_OP32(hash32, (UINT)*(UINT *)p);
}
if (wrdlen & -2) {
hash32 = FNV_32A_OP32(hash32, *(UINT*)p&0xFFFF);
p++;p++;
}
if (wrdlen & 1)
hash32 = FNV_32A_OP(hash32, *p);
return hash32 ^ (hash32 >> 16);
}
'FNV1A_unrolled_Final' follows:
UINT HashFNV1A_unrolled_Final(const CHAR *str, SIZE_T wrdlen)
{
//const UINT PRIME = 31;
UINT hash = 2166136261;
const CHAR * p = str;
/*
// Reduce the number of multiplications by unrolling the loop
for (SIZE_T ndwords = wrdlen / sizeof(DWORD); ndwords; --ndwords) {
//hash = (hash ^ *(DWORD*)p) * PRIME;
hash = ((hash ^ *(DWORD*)p)<<5) - (hash ^ *(DWORD*)p);
p += sizeof(DWORD);
}
*/
for(; wrdlen >= 4; wrdlen -= 4, p += 4) {
hash = ((hash ^ *(DWORD*)p)<<5) - (hash ^ *(DWORD*)p);
}
// Process the remaining bytes
/*
for (SIZE_T i = 0; i < (wrdlen & (sizeof(DWORD) - 1)); i++) {
//hash = (hash ^ *p++) * PRIME;
hash = ((hash ^ *p)<<5) - (hash ^ *p);
p++;
}
*/
if (wrdlen & -2) {
hash = ((hash ^ (*(DWORD*)p&0xFFFF))<<5) - (hash ^ (*(DWORD*)p&0xFFFF));
p++;p++;
}
if (wrdlen & 1)
hash = ((hash ^ *p)<<5) - (hash ^ *p);
return (hash>>16) ^ hash;
}
Also a new FNV-1A derivate hasher targeted on case-insensitive-latin-letters I made, perhaps it is not useful for anything else, but the idea of making ONE multiplication/(shift+subtraction) within 6 chars range is worth thinking of, do you agree?
At first the name was 'Sixtine' a la S.King's 'Christine' monster, but I wanted to remind also for the insensitiveness.
// Tuned for lowercase-and-uppercase letters i.e. 26 ASCII symbols 65-90 and 97-122 decimal.
UINT Sixtinsensitive(const CHAR *str, SIZE_T wrdlen)
{
UINT hash = 2166136261;
UINT hashBUFFER_EAX, hashBUFFER_BH, hashBUFFER_BL;
const CHAR * p = str;
// Ox41 = 065 'A' 010 [0 0001]
// Ox5A = 090 'Z' 010 [1 1010]
// Ox61 = 097 'a' 011 [0 0001]
// Ox7A = 122 'z' 011 [1 1010]
// Reduce the number of multiplications by unrolling the loop
for(; wrdlen >= 6; wrdlen -= 6, p += 6) {
//hashBUFFER_AX = (*(DWORD*)(p+0)&0xFFFF);
hashBUFFER_EAX = (*(DWORD*)(p+0)&0x1F1F1F1F);
hashBUFFER_BL = (*(p+4)&0x1F);
hashBUFFER_BH = (*(p+5)&0x1F);
//6bytes-in-4bytes or 48bits-to-30bits
// Two times next:
//3bytes-in-2bytes or 24bits-to-15bits
//EAX BL BH
//[5bit][3bit][5bit][3bit][5bit][3bit][5bit][3bit]
// 5th[0..15] 13th[0..15]
// BL lower 3 BL higher 2bits
// OR or XOR no difference
hashBUFFER_EAX = hashBUFFER_EAX ^ ((hashBUFFER_BL&0x07)<<5); // BL lower 3bits of 5bits
hashBUFFER_EAX = hashBUFFER_EAX ^ ((hashBUFFER_BL&0x18)<<(2+8)); // BL higher 2bits of 5bits
hashBUFFER_EAX = hashBUFFER_EAX ^ ((hashBUFFER_BH&0x07)<<(5+16)); // BH lower 3bits of 5bits
hashBUFFER_EAX = hashBUFFER_EAX ^ ((hashBUFFER_BH&0x18)<<((2+8)+16)); // BH higher 2bits of 5bits
//hash = (hash ^ hashBUFFER_EAX)*1607; //What a mess: <<7 becomes imul but <<5 not!?
hash = ((hash ^ hashBUFFER_EAX)<<5) - (hash ^ hashBUFFER_EAX);
//1607:[2118599]
// 127:[2121081]
// 31:[2139242]
// 17:[2150803]
// 7:[2166336]
// 5:[2183044]
//8191:[2200477]
// 3:[2205095]
// 257:[2206188]
}
// Post-Variant #1:
for(; wrdlen; wrdlen--, p++) {
hash = ((hash ^ (*p&0x1F))<<5) - (hash ^ (*p&0x1F));
}
/*
// Post-Variant #2:
for(; wrdlen >= 2; wrdlen -= 2, p += 2) {
hash = ((hash ^ (*(DWORD*)p&0xFFFF))<<5) - (hash ^ (*(DWORD*)p&0xFFFF));
}
if (wrdlen & 1)
hash = ((hash ^ *p)<<5) - (hash ^ *p);
*/
/*
// Post-Variant #3:
for(; wrdlen >= 4; wrdlen -= 4, p += 4) {
hash = ((hash ^ *(DWORD*)p)<<5) - (hash ^ *(DWORD*)p);
}
if (wrdlen & -2) {
hash = ((hash ^ (*(DWORD*)p&0xFFFF))<<5) - (hash ^ (*(DWORD*)p&0xFFFF));
p++;p++;
}
if (wrdlen & 1)
hash = ((hash ^ *p)<<5) - (hash ^ *p);
*/
return (hash>>16) ^ hash;
}
?Sixtinsensitive@@YAIPBDK@Z PROC ; Sixtinsensitive
; 891 : {
00300 53 push ebx
00301 55 push ebp
; 892 : UINT hash = 2166136261;
; 893 : UINT hashBUFFER_EAX, hashBUFFER_BH, hashBUFFER_BL;
; 894 : const CHAR * p = str;
; 895 :
; 896 : // Ox41 = 065 'A' 010 [0 0001]
; 897 : // Ox5A = 090 'Z' 010 [1 1010]
; 898 : // Ox61 = 097 'a' 011 [0 0001]
; 899 : // Ox7A = 122 'z' 011 [1 1010]
; 900 :
; 901 : // Reduce the number of multiplications by unrolling the loop
; 902 : for(; wrdlen >= 6; wrdlen -= 6, p += 6) {
00302 8b 6c 24 10 mov ebp, DWORD PTR _wrdlen$[esp+4]
00306 57 push edi
00307 8b 7c 24 10 mov edi, DWORD PTR _str$[esp+8]
0030b bb c5 9d 1c 81 mov ebx, -2128831035 ; 811c9dc5H
00310 83 fd 06 cmp ebp, 6
00313 72 5c jb SHORT $LN4@Sixtinsens
00315 b8 ab aa aa aa mov eax, -1431655765 ; aaaaaaabH
0031a f7 e5 mul ebp
0031c c1 ea 02 shr edx, 2
0031f 56 push esi
$LL6@Sixtinsens:
; 903 : //hashBUFFER_AX = (*(DWORD*)(p+0)&0xFFFF);
; 904 : hashBUFFER_EAX = (*(DWORD*)(p+0)&0x1F1F1F1F);
; 905 : hashBUFFER_BL = (*(p+4)&0x1F);
; 906 : hashBUFFER_BH = (*(p+5)&0x1F);
00320 0f be 77 05 movsx esi, BYTE PTR [edi+5]
00324 0f be 4f 04 movsx ecx, BYTE PTR [edi+4]
00328 83 e6 1f and esi, 31 ; 0000001fH
; 907 : //6bytes-in-4bytes or 48bits-to-30bits
; 908 : // Two times next:
; 909 : //3bytes-in-2bytes or 24bits-to-15bits
; 910 : //EAX BL BH
; 911 : //[5bit][3bit][5bit][3bit][5bit][3bit][5bit][3bit]
; 912 : // 5th[0..15] 13th[0..15]
; 913 : // BL lower 3 BL higher 2bits
; 914 : // OR or XOR no difference
; 915 : hashBUFFER_EAX = hashBUFFER_EAX ^ ((hashBUFFER_BL&0x07)<<5); // BL lower 3bits of 5bits
; 916 : hashBUFFER_EAX = hashBUFFER_EAX ^ ((hashBUFFER_BL&0x18)<<(2+8)); // BL higher 2bits of 5bits
; 917 : hashBUFFER_EAX = hashBUFFER_EAX ^ ((hashBUFFER_BH&0x07)<<(5+16)); // BH lower 3bits of 5bits
; 918 : hashBUFFER_EAX = hashBUFFER_EAX ^ ((hashBUFFER_BH&0x18)<<((2+8)+16)); // BH higher 2bits of 5bits
; 919 : //hash = (hash ^ hashBUFFER_EAX)*1607; //What a mess: <<7 becomes imul but <<5 not!?
; 920 : hash = ((hash ^ hashBUFFER_EAX)<<5) - (hash ^ hashBUFFER_EAX);
0032b 8b c6 mov eax, esi
0032d 83 e0 18 and eax, 24 ; 00000018H
00330 c1 e0 05 shl eax, 5
00333 83 e1 1f and ecx, 31 ; 0000001fH
00336 83 e6 07 and esi, 7
00339 33 c6 xor eax, esi
0033b 8b f1 mov esi, ecx
0033d c1 e0 0b shl eax, 11 ; 0000000bH
00340 83 e1 07 and ecx, 7
00343 83 e6 18 and esi, 24 ; 00000018H
00346 33 c6 xor eax, esi
00348 c1 e0 05 shl eax, 5
0034b 33 c1 xor eax, ecx
0034d 8b 0f mov ecx, DWORD PTR [edi]
0034f 81 e1 1f 1f 1f
1f and ecx, 522133279 ; 1f1f1f1fH
00355 c1 e0 05 shl eax, 5
00358 33 c1 xor eax, ecx
0035a 33 c3 xor eax, ebx
0035c 8b c8 mov ecx, eax
0035e c1 e1 05 shl ecx, 5
00361 2b c8 sub ecx, eax
00363 83 ed 06 sub ebp, 6
00366 83 c7 06 add edi, 6
00369 83 ea 01 sub edx, 1
0036c 8b d9 mov ebx, ecx
0036e 75 b0 jne SHORT $LL6@Sixtinsens
00370 5e pop esi
$LN4@Sixtinsens:
; 921 : //1607:[2118599]
; 922 : // 127:[2121081]
; 923 : // 31:[2139242]
; 924 : // 17:[2150803]
; 925 : // 7:[2166336]
; 926 : // 5:[2183044]
; 927 : //8191:[2200477]
; 928 : // 3:[2205095]
; 929 : // 257:[2206188]
; 930 : }
; 931 : // Post-Variant #1:
; 932 : for(; wrdlen; wrdlen--, p++) {
00371 85 ed test ebp, ebp
00373 74 17 je SHORT $LN1@Sixtinsens
$LL3@Sixtinsens:
; 933 : hash = ((hash ^ (*p&0x1F))<<5) - (hash ^ (*p&0x1F));
00375 0f be 07 movsx eax, BYTE PTR [edi]
00378 83 e0 1f and eax, 31 ; 0000001fH
0037b 33 c3 xor eax, ebx
0037d 8b d0 mov edx, eax
0037f c1 e2 05 shl edx, 5
00382 2b d0 sub edx, eax
00384 4d dec ebp
00385 47 inc edi
00386 8b da mov ebx, edx
00388 85 ed test ebp, ebp
0038a 75 e9 jne SHORT $LL3@Sixtinsens
$LN1@Sixtinsens:
; 934 : }
; 935 : /*
; 936 : // Post-Variant #2:
; 937 : for(; wrdlen >= 2; wrdlen -= 2, p += 2) {
; 938 : hash = ((hash ^ (*(DWORD*)p&0xFFFF))<<5) - (hash ^ (*(DWORD*)p&0xFFFF));
; 939 : }
; 940 : if (wrdlen & 1)
; 941 : hash = ((hash ^ *p)<<5) - (hash ^ *p);
; 942 : */
; 943 : /*
; 944 : // Post-Variant #3:
; 945 : for(; wrdlen >= 4; wrdlen -= 4, p += 4) {
; 946 : hash = ((hash ^ *(DWORD*)p)<<5) - (hash ^ *(DWORD*)p);
; 947 : }
; 948 : if (wrdlen & -2) {
; 949 : hash = ((hash ^ (*(DWORD*)p&0xFFFF))<<5) - (hash ^ (*(DWORD*)p&0xFFFF));
; 950 : p++;p++;
; 951 : }
; 952 : if (wrdlen & 1)
; 953 : hash = ((hash ^ *p)<<5) - (hash ^ *p);
; 954 : */
; 955 : return (hash>>16) ^ hash;
0038c 8b c3 mov eax, ebx
0038e 5f pop edi
0038f c1 e8 10 shr eax, 16 ; 00000010H
00392 5d pop ebp
00393 33 c3 xor eax, ebx
00395 5b pop ebx
; 956 : }
00396 c3 ret 0
?Sixtinsensitive@@YAIPBDK@Z ENDP ; Sixtinsensitive
Still, I am not happy with 'Sixtinsensitive' speed which is about (12445829-11075601 )/11075601*100%=12.3% slower than 'FNV1A_unrolled_Final'. If you are interested in boosting such an approach I will be glad.
D:\_KAZE_new-stuff\_w>hash.exe wikipedia-en-html.tar.wrd
12561874 lines read
33554432 elements in the table (25 bits)
Sixtinsensitive: 12554437 12465228 12466856 12445829 12447431| 12445829 [2139242]
x17 unrolled: 11927763 11908183 11928229 11923448 11929265| 11908183 [2410605]
FNV1A_Hash_WHIZ: 10642905 10647391 10631697 10633363 10626230| 10626230 [2189360] // FNV1_32_PRIME = 1607
FNV1A_Hash_WHIZ: 10567474 10566509 10587741 10604073 10563141| 10563141 [2144749] // FNV1_32_PRIME = 1999
FNV1A_Hash_WHIZ: 10535248 10555186 10514388 10517062 10518655| 10514388 [2154569] // FNV1_32_PRIME = 1579
FNV1A_unrolled_Final: 11101280 11084731 11105538 11075601 11085023| 11075601 [2252381]
FNV1A_unrolled: 11212976 11226149 11236817 11188433 11262786| 11188433 [2191287]
Alfalfa: 11837124 11864087 11835669 11856730 11841964| 11835669 [2074883]
Alfalfa_HALF: 11978432 11984171 11955814 11966588 11961493| 11955814 [2077426]
D:\_KAZE_new-stuff\_w>hash_benchmark.bat
Words
500 lines read
1024 elements in the table (10 bits)
Sixtinsensitive: 99 91 90 90 89| 89 [ 105]
x17 unrolled: 97 92 91 118 91| 91 [ 109]
FNV1A_Hash_WHIZ: 90 84 82 82 81| 81 [ 124]
FNV1A_unrolled_Final: 87 80 79 78 78| 78 [ 115]
FNV1A_unrolled: 90 83 81 80 80| 80 [ 106]
Alfalfa: 94 88 87 86 87| 86 [ 100]
Alfalfa_HALF: 98 111 91 91 90| 90 [ 97]
Win32
1992 lines read
4096 elements in the table (12 bits)
Sixtinsensitive: 524 512 534 510 509| 509 [ 409]
x17 unrolled: 592 617 588 588 589| 588 [ 414]
FNV1A_Hash_WHIZ: 464 437 436 437 456| 436 [ 418]
FNV1A_unrolled_Final: 434 425 424 423 449| 423 [ 414]
FNV1A_unrolled: 433 426 425 424 444| 424 [ 404]
Alfalfa: 573 569 568 587 568| 568 [ 411]
Alfalfa_HALF: 574 587 566 566 565| 565 [ 428]
Numbers
500 lines read
1024 elements in the table (10 bits)
Sixtinsensitive: 66 65 65 65 66| 65 [ 206]
x17 unrolled: 48 48 48 48 48| 48 [ 24]
FNV1A_Hash_WHIZ: 52 51 50 50 50| 50 [ 304]
FNV1A_unrolled_Final: 69 69 69 68 69| 68 [ 420]
FNV1A_unrolled: 70 69 68 68 68| 68 [ 420]
Alfalfa: 45 45 45 45 45| 45 [ 160]
Alfalfa_HALF: 54 53 53 53 53| 53 [ 288]
Prefix
500 lines read
1024 elements in the table (10 bits)
Sixtinsensitive: 165 162 162 162 162| 162 [ 106]
x17 unrolled: 205 204 204 204 204| 204 [ 113]
FNV1A_Hash_WHIZ: 129 146 128 128 128| 128 [ 100]
FNV1A_unrolled_Final: 123 122 122 122 122| 122 [ 101]
FNV1A_unrolled: 125 124 124 123 123| 123 [ 116]
Alfalfa: 199 219 200 200 200| 199 [ 104]
Alfalfa_HALF: 200 199 199 199 199| 199 [ 115]
Postfix
500 lines read
1024 elements in the table (10 bits)
Sixtinsensitive: 164 161 160 160 160| 160 [ 116]
x17 unrolled: 201 200 200 200 200| 200 [ 102]
FNV1A_Hash_WHIZ: 127 126 125 126 125| 125 [ 109]
FNV1A_unrolled_Final: 122 120 120 120 120| 120 [ 110]
FNV1A_unrolled: 120 119 263 118 118| 118 [ 102]
Alfalfa: 198 198 197 197 197| 197 [ 116]
Alfalfa_HALF: 198 197 196 302 196| 196 [ 111]
Variables
1842 lines read
4096 elements in the table (12 bits)
Sixtinsensitive: 433 416 415 415 436| 415 [ 374]
x17 unrolled: 438 434 433 451 433| 433 [ 368]
FNV1A_Hash_WHIZ: 369 361 358 360 383| 358 [ 353]
FNV1A_unrolled_Final: 363 354 353 352 353| 352 [ 352]
FNV1A_unrolled: 367 360 359 357 358| 357 [ 341]
Alfalfa: 440 415 413 413 413| 413 [ 343]
Alfalfa_HALF: 462 438 437 435 436| 435 [ 396]
Sonnets
3228 lines read
8192 elements in the table (13 bits)
Sixtinsensitive: 611 587 587 586 586| 586 [ 542]
x17 unrolled: 618 586 585 585 585| 585 [ 589]
FNV1A_Hash_WHIZ: 523 514 513 514 512| 512 [ 555]
FNV1A_unrolled_Final: 527 519 517 516 517| 516 [ 582]
FNV1A_unrolled: 531 525 522 523 523| 522 [ 570]
Alfalfa: 562 555 555 555 554| 554 [ 570]
Alfalfa_HALF: 589 579 579 577 578| 577 [ 543]
UTF-8
13408 lines read
32768 elements in the table (15 bits)
Sixtinsensitive: 2885 2813 2789 2793 2807| 2789 [ 2414]
x17 unrolled: 2910 2910 2889 2896 2915| 2889 [ 2392]
FNV1A_Hash_WHIZ: 2493 2492 2501 2473 2476| 2473 [ 2403]
FNV1A_unrolled_Final: 2479 2474 2494 2471 2483| 2471 [ 2446]
FNV1A_unrolled: 2508 2520 2547 2693 2470| 2470 [ 2421]
Alfalfa: 2735 2726 2725 2743 2761| 2725 [ 2415]
Alfalfa_HALF: 2927 2892 2900 2943 2933| 2892 [ 2445]
IPv4
3925 lines read
8192 elements in the table (13 bits)
Sixtinsensitive: 543 526 526 525 526| 525 [ 1443]
x17 unrolled: 444 444 443 444 443| 443 [ 829]
FNV1A_Hash_WHIZ: 397 396 394 394 395| 394 [ 1404]
FNV1A_unrolled_Final: 394 395 395 394 394| 394 [ 1419]
FNV1A_unrolled: 400 403 404 402 401| 400 [ 1419]
Alfalfa: 405 419 405 403 403| 403 [ 728]
Alfalfa_HALF: 434 433 432 432 436| 432 [ 813]
D:\_KAZE_new-stuff\_w>
The next 4 keys(words/phrases/sentences/paragraphs) occupy one slot, and the 'Sixtinsensitive' speed performance boosts as the key length increases:
pneumonoultramicroscopicsilicovolcanoconiosis: "A facetious word alleged to mean 'a lung disease caused by the inhalation of very fine silica dust' but occurring chiefly as an instance of a very long word." [OED] Online Etymology Dictionary, (c) 2010 Douglas Harper
Pneumonoultramicroscopicsilicovolcanoconiosis: "A Facetious Word Alleged To Mean 'a Lung Disease Caused By The Inhalation Of Very Fine Silica Dust' But Occurring Chiefly As An Instance Of A Very Long Word." [Oed] Online Etymology Dictionary, (C) 2010 Douglas Harper
PNEUMONOULTRAMICROSCOPICSILICOVOLCANOCONIOSIS: "A FACETIOUS WORD ALLEGED TO MEAN 'A LUNG DISEASE CAUSED BY THE INHALATION OF VERY FINE SILICA DUST' BUT OCCURRING CHIEFLY AS AN INSTANCE OF A VERY LONG WORD." [OED] ONLINE ETYMOLOGY DICTIONARY, (C) 2010 DOUGLAS HARPER
pneumonoultramicroscopicsilicovolcanoconiosis: "a facetious word alleged to mean 'a lung disease caused by the inhalation of very fine silica dust' but occurring chiefly as an instance of a very long word." [oed] online etymology dictionary, (c) 2010 douglas harper
D:\_KAZE_new-stuff\_w>hash.exe 4keys.txt
4 lines read
8 elements in the table (3 bits)
Sixtinsensitive: 7 6 6 6 6| 6 [ 3]
x17 unrolled: 10 10 10 10 10| 10 [ 0]
FNV1A_Hash_WHIZ: 5 5 5 5 5| 5 [ 1]
FNV1A_unrolled_Final: 4 4 4 4 4| 4 [ 0]
FNV1A_unrolled: 4 4 4 4 4| 4 [ 0]
Alfalfa: 10 10 10 10 10| 10 [ 0]
Alfalfa_HALF: 10 10 10 10 10| 10 [ 1]
D:\_KAZE_new-stuff\_w>
Once again, it would bring joy if 'Sixtinsensitive' could outperform 'FNV1A_unrolled', personally I don't get the above result, my expectation was 'Sixtinsensitive' to whiz with such a long keys, obviously some things must be improved.
Georgi, here are the results on Pentium M:
This code:
should be:With
*(DWORD*)
, a buffer overrun is possible at the end of a memory page. Also notewrdlen & 2
instead ofwrdlen & -2
.Generally, Whiz is one of the fastest hash function at this moment :) It proves that a simpler hash function can be faster despite of higher number of collisions. Thank you very much for your contribution!
dummy me, you are right: 'if (wrdlen & -2) {' becomes 'if (wrdlen & 2) {'.
The idea was/is to cover the three cases: 0,1,2,3 with two IFs.
As for 'With *(DWORD*), a buffer overrun is possible at the end of a memory page.' I knew about it but was fooled by assembly code generated by VS2010 which translates it to a word access:
; 792 : hash32 = FNV_32A_OP32(hash32, *(UINT*)p&0xFFFF);
00360 0f b7 30 movzx esi, WORD PTR [eax]
00363 33 f1 xor esi, ecx
00365 69 f6 47 06 00
00 imul esi, 1607 ; 00000647H
0036b 8b ce mov ecx, esi
In fact I still don't know how to operate directly with words(two bytes, unsigned short int may be 32bit instead of 16bit, yes?) in C, I was eager to share them, glad that you fixed the bugs.
But one mystery(caused by alignment!?) remains: why on my Pentium Merom results with Wikipedia show a very different roster!
I've just run the Wikipedia test on Core i5. Whiz is very fast.
Unfortunately, it's slower on Pentium M. I repeated the test twice on both processors, and the results don't change. Most likely, there is some unlucky difference between microarchitectures, but we should optimize for the newer processors (your Merom and my Lynnfield).
I have some high hopes(due to cheap lowercasing and 4+2 granularity) for 'Sixtinsensitive', already incorporated in Leprechaun with very consistent distribution regardless of FNV1_32_PRIME = 3,5,7,17,31,127,257,8191.
Isn't the reason for your observations the fact that your source strings simply aren't aligned anymore? That's why Murmur and anything that uses "dwords" should be penalized on all architectures that are affected by alignment. I believe if you tried with MIPS or ARM you'd also see the problems of unaligned access. But as you also see now, it's very convenient to have strings not starting on aligned address (you can omit a lot of allocations, copying and also reduce the memory needs).
But you also know my general opinion: small differences in speed (under 10 or 20%, and with changes across the platform) aren't too relevant, there's got to be some more dramatic demonstration of superiority of some function to some other one in order to claim a really "relevant" winner, and the winner should be "overall good" not only "good on the CPU I have at the moment". I'd always prefer to use the function that is "near the best" on more platforms to the function "the best on one CPU and bad on others."
Ace, the strings were not aligned in the earlier version of the program. There was an arena allocator: a large char array and a (not aligned) pointer into this char array, which was increased as the lines were read from file.
I agree that the winner should be overall good, having near the best speed and good number of collisions in all tests. At this time, Murmur2 is the best; hardware-accelerated CRC may be used if the target CPU is known to support it.
http://encode.ru/threads/1155-A-new-match-searching-structure?p=22923#post22923
Ace, here are the results on Wikipedia, Pentium M:
In short, no difference. The small test is more interesting. Here is the aligned version, the not aligned one can be found above:
The functions that read WORD at time (Fletcher and Paul Hseih) become faster when the strings are aligned. Lookup3 relies on alignment (see its code), so it's also faster in this version.
Georgi, thanks for publishing your results.
- FNV1A_Whiz, FNV1_32_PRIME=709607
- FNV1A_Smaragd, FNV1_32_PRIME=709607
- FNV1A_Peregrine, FNV1_32_PRIME=709607
- FNV1A_Nefertiti, FNV1_32_PRIME=31 (in fact MULless)
Romanticism in me dictates this replacement: variants with no personality to have their own first name (Whiz, Smaragd, Peregrine, Nefertiti) and family name(FNV1A). Not to mention the clarity when references are to be made.
To obtain sources(with corresponding 32bit instructions) of above four(plus Sixtinsensitive+ and Sixtinsensitive and four Alfalfa variants) download the PDF booklet at: http://www.sanmayce.com/Downloads/_Kaze_10-HASHERS.pdf
I expect Peregrine(using one 64bit memory access) when compiled for 64bit to outspeed Nefertiti(former 'FNV1A_unrolled_Final').
In other hand Smaragd(using one 32bit memory access but with 2 passes) is slower than Nefertiti but with less collisions.
And Whiz is just a lucky prodigy.
I think Peregrine is most promising hasher because for longer strings with granularity 8(full-fledged when compiled as 64bit) it will fly for sure.
I want more people to experience the hash-benchmark-fun, so here is my newest test package(131,981,551 bytes):
http://www.sanmayce.com/Downloads/_KAZE_hash_test_r1.rar
It looks like this:
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>dir
11/11/2010 07:52 AM 195,935 hash.cod
11/11/2010 07:52 AM 86,528 hash.exe
11/11/2010 07:31 AM <DIR> Peter_source
11/11/2010 07:52 AM 1,748 Runme.bat
11/11/2010 07:52 AM 4,347,243 Sentence-list_00,032,359_English_The_Holy_Bible.txt
11/11/2010 07:52 AM 388,308 Word-list_00,038,936_English_The Oxford Thesaurus, An A-Z Dictionary of Synonyms.wrd
11/11/2010 07:52 AM 1,121,365 Word-list_00,105,982_English_Spell-Check_High-Quality.wrd
11/11/2010 07:52 AM 4,024,146 Word-list_00,351,114_English_Spell-Check_Unknown-Quality.wrd
11/11/2010 07:52 AM 7,000,453 Word-list_00,584,879_Russian_Spell-Check_Unknown-Quality.slv
11/11/2010 07:52 AM 146,973,879 Word-list_12,561,874_wikipedia-en-html.tar.wrd
11/11/2010 07:52 AM 278,013,406 Word-list_22,202,980_wikipedia-de-en-es-fr-it-nl-pt-ro-html.tar.wrd
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>Runme.bat
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>hash "Word-list_00,584,879_Russian_Spell-Check_Unknown-Quality.slv" 1>"Word-list_00,584,879.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>sort /+75 "Word-list_00,584,879.txt" 1>"Word-list_00,584,879_SPEED.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>sort /+85 "Word-list_00,584,879.txt" 1>"Word-list_00,584,879_COLLISIONS.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>hash "Sentence-list_00,032,359_English_The_Holy_Bible.txt" 1>"Sentence-list_00,032,359.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>sort /+75 "Sentence-list_00,032,359.txt" 1>"Sentence-list_00,032,359_SPEED.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>sort /+85 "Sentence-list_00,032,359.txt" 1>"Sentence-list_00,032,359_COLLISIONS.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>hash "Word-list_00,038,936_English_The Oxford Thesaurus, An A-Z Dictionary of Synonyms.wrd" 1>"Word-list_00,038,936.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>sort /+75 "Word-list_00,038,936.txt" 1>"Word-list_00,038,936_SPEED.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>sort /+85 "Word-list_00,038,936.txt" 1>"Word-list_00,038,936_COLLISIONS.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>hash "Word-list_00,105,982_English_Spell-Check_High-Quality.wrd" 1>"Word-list_00,105,982.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>sort /+75 "Word-list_00,105,982.txt" 1>"Word-list_00,105,982_SPEED.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>sort /+85 "Word-list_00,105,982.txt" 1>"Word-list_00,105,982_COLLISIONS.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>hash "Word-list_00,351,114_English_Spell-Check_Unknown-Quality.wrd" 1>"Word-list_00,351,114.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>sort /+75 "Word-list_00,351,114.txt" 1>"Word-list_00,351,114_SPEED.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>sort /+85 "Word-list_00,351,114.txt" 1>"Word-list_00,351,114_COLLISIONS.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>hash "Word-list_12,561,874_wikipedia-en-html.tar.wrd" 1>"Word-list_12,561,874.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>sort /+75 "Word-list_12,561,874.txt" 1>"Word-list_12,561,874_SPEED.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>sort /+85 "Word-list_12,561,874.txt" 1>"Word-list_12,561,874_COLLISIONS.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>hash "Word-list_22,202,980_wikipedia-de-en-es-fr-it-nl-pt-ro-html.tar.wrd" 1>"Word-list_22,202,980.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>sort /+75 "Word-list_22,202,980.txt" 1>"Word-list_22,202,980_SPEED.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>sort /+85 "Word-list_22,202,980.txt" 1>"Word-list_22,202,980_COLLISIONS.txt"
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>echo Done.
Done.
D:\_KAZE_new-stuff\_KAZE_hash_test_r1>dir
11/11/2010 07:52 AM 195,935 hash.cod
11/11/2010 07:52 AM 86,528 hash.exe
11/11/2010 07:31 AM <DIR> Peter_source
11/11/2010 07:52 AM 1,748 Runme.bat
11/11/2010 08:30 AM 2,937 Sentence-list_00,032,359.txt
11/11/2010 08:30 AM 2,937 Sentence-list_00,032,359_COLLISIONS.txt
11/11/2010 07:52 AM 4,347,243 Sentence-list_00,032,359_English_The_Holy_Bible.txt
11/11/2010 08:30 AM 2,937 Sentence-list_00,032,359_SPEED.txt
11/11/2010 08:30 AM 2,938 Word-list_00,038,936.txt
11/11/2010 08:30 AM 2,938 Word-list_00,038,936_COLLISIONS.txt
11/11/2010 07:52 AM 388,308 Word-list_00,038,936_English_The Oxford Thesaurus, An A-Z Dictionary of Synonyms.wrd
11/11/2010 08:30 AM 2,938 Word-list_00,038,936_SPEED.txt
11/11/2010 08:30 AM 2,939 Word-list_00,105,982.txt
11/11/2010 08:30 AM 2,939 Word-list_00,105,982_COLLISIONS.txt
11/11/2010 07:52 AM 1,121,365 Word-list_00,105,982_English_Spell-Check_High-Quality.wrd
11/11/2010 08:30 AM 2,939 Word-list_00,105,982_SPEED.txt
11/11/2010 08:30 AM 2,940 Word-list_00,351,114.txt
11/11/2010 08:30 AM 2,940 Word-list_00,351,114_COLLISIONS.txt
11/11/2010 07:52 AM 4,024,146 Word-list_00,351,114_English_Spell-Check_Unknown-Quality.wrd
11/11/2010 08:30 AM 2,940 Word-list_00,351,114_SPEED.txt
11/11/2010 08:30 AM 2,940 Word-list_00,584,879.txt
11/11/2010 08:30 AM 2,940 Word-list_00,584,879_COLLISIONS.txt
11/11/2010 07:52 AM 7,000,453 Word-list_00,584,879_Russian_Spell-Check_Unknown-Quality.slv
11/11/2010 08:30 AM 2,940 Word-list_00,584,879_SPEED.txt
11/11/2010 08:47 AM 2,943 Word-list_12,561,874.txt
11/11/2010 08:47 AM 2,943 Word-list_12,561,874_COLLISIONS.txt
11/11/2010 08:47 AM 2,943 Word-list_12,561,874_SPEED.txt
11/11/2010 07:52 AM 146,973,879 Word-list_12,561,874_wikipedia-en-html.tar.wrd
11/11/2010 09:19 AM 2,943 Word-list_22,202,980.txt
11/11/2010 09:19 AM 2,943 Word-list_22,202,980_COLLISIONS.txt
11/11/2010 09:19 AM 2,943 Word-list_22,202,980_SPEED.txt
11/11/2010 07:52 AM 278,013,406 Word-list_22,202,980_wikipedia-de-en-es-fr-it-nl-pt-ro-html.tar.wrd
For Intel T3400 Merom 2.16GHz, the contents of 'Sentence-list_00,032,359_SPEED.txt' which in fact is the 32,359 The_Holy_Bible's sentences:
65536 elements in the table (16 bits)
32359 lines read
Fletcher: 28567 28709 28329 28521 28596| 28329 [ 7209]
FNV1A_Nefertiti: 30434 29919 30281 30112 30354| 29919 [ 6878]
FNV1A_Peregrine: 30652 30654 30774 30711 30728| 30652 [ 6838]
FNV1A_Whiz: 31917 31798 31608 31624 31934| 31608 [ 6874]
Murmur2: 32442 31898 31622 31843 31838| 31622 [ 6786]
Sixtinsensitive+: 34803 35073 35052 35265 35171| 34803 [ 6839]
SBox: 35864 36651 35840 35964 36036| 35840 [ 6839]
Novak unrolled: 37072 36698 36464 36601 37234| 36464 [ 6826]
Sixtinsensitive: 38126 38161 38783 38085 38243| 38085 [ 6876]
Paul Hsieh: 41114 40952 40467 40903 41335| 40467 [ 6874]
FNV1A_Smaragd: 40618 40934 40921 40550 40621| 40550 [ 6849]
lookup3: 42606 42204 42536 42584 42619| 42204 [ 6805]
Alfalfa_QWORD: 48411 47007 47281 47491 47252| 47007 [ 6943]
CRC-32: 47458 48495 47733 47692 48641| 47458 [ 6891]
Hanson: 49364 50769 49310 49310 49223| 49223 [ 19602]
Alfalfa: 53467 52707 53514 52711 52163| 52163 [ 6943]
x65599: 52965 53324 52546 52944 52932| 52546 [ 6859]
Paul Larson: 53238 52846 52558 53073 52954| 52558 [ 6889]
x17 unrolled: 53407 53882 53568 53693 53760| 53407 [ 6827]
Alfalfa_HALF: 53931 54075 54072 53659 53691| 53659 [ 6821]
Alfalfa_DWORD: 56396 56912 57265 57019 56184| 56184 [ 6943]
FNV-1a: 61401 61344 61129 60464 61180| 60464 [ 6840]
Bernstein: 62958 62738 63534 62466 62337| 62337 [ 6858]
K&R: 63326 63728 62776 62787 62992| 62776 [ 6785]
Sedgewick: 64195 63916 64174 63471 64562| 63471 [ 6858]
MaPrime2c: 63779 64134 63573 63849 64284| 63573 [ 6950]
Ramakrishna: 68307 69456 68476 68773 67562| 67562 [ 6943]
Arash Partow: 75811 75134 74359 74849 75068| 74359 [ 6845]
One At Time: 75324 76130 75891 75592 76803| 75324 [ 6937]
Weinberger: 102085 103031 102324 101752 101774| 101752 [ 6871]
For Intel T3400 Merom 2.16GHz, the contents of 'Word-list_22,202,980_COLLISIONS.txt' which in fact is the wikipedia-de-en-es-fr-it-nl-pt-ro-html.tar's words:
67108864 elements in the table (26 bits)
22202980 lines read
Alfalfa_HALF: 21589312 21582860 21557249 21529387 21529700| 21529387 [ 3286890]
Alfalfa: 21797854 21768913 21771533 21790962 21787538| 21768913 [ 3288684]
Alfalfa_QWORD: 22137983 22122623 22102676 22106176 22115088| 22102676 [ 3288684]
Alfalfa_DWORD: 21922825 21938602 21896015 21906405 21938735| 21896015 [ 3288684]
Bernstein: 22069765 22066886 22078223 22068073 22070701| 22066886 [ 3290766]
K&R: 21767970 21769291 21786733 21808794 21781952| 21767970 [ 3290941]
Paul Larson: 22217132 22208663 22221180 22257973 22232433| 22208663 [ 3296692]
FNV-1a: 24783327 24754595 24761221 24768850 24773851| 24754595 [ 3297552]
Murmur2: 24429581 24436845 24439565 24473986 24420091| 24420091 [ 3297709]
SBox: 24497356 24495252 24474408 24482514 24486617| 24474408 [ 3298021]
FNV1A_Smaragd: 24892728 24657954 24713885 24642826 24653872| 24642826 [ 3298433]
CRC-32: 24315125 24364015 24358932 24370230 24357086| 24315125 [ 3298998]
lookup3: 25152714 25163446 25180860 25174309 25171522| 25152714 [ 3299369]
MaPrime2c: 25468907 25497087 25515137 25496127 25511109| 25468907 [ 3299747]
Sedgewick: 23187075 23188783 23201984 23182620 23187676| 23182620 [ 3302263]
One At Time: 25752155 25745719 25763766 25769312 25760894| 25745719 [ 3304908]
Ramakrishna: 22571194 22588985 22583130 22584062 22567869| 22567869 [ 3321824]
x65599: 22869457 22871793 22868825 22899146 22893458| 22868825 [ 3325064]
Arash Partow: 23330844 23304012 23325130 23319561 23321201| 23304012 [ 3325683]
FNV1A_Peregrine: 24709308 24696956 24739156 24542282 24027705| 24027705 [ 3333193]
FNV1A_Whiz: 23260698 23284708 23261485 23231186 23274844| 23231186 [ 3369088]
Sixtinsensitive: 23749686 23754606 23752685 23771534 23756371| 23749686 [ 3373923]
Hanson: 24678894 24686360 24668732 24692393 24679870| 24668732 [ 3408497]
Paul Hsieh: 25165181 25195490 25185265 25191703 25191274| 25165181 [ 3498543]
FNV1A_Nefertiti: 23178958 23204612 23158585 23201471 23168530| 23158585 [ 3505371]
Sixtinsensitive+: 23695832 23697515 23670644 23686530 23677100| 23670644 [ 3507772]
x17 unrolled: 21089676 21072607 21073703 21093963 21080539| 21072607 [ 3830652]
Weinberger: 27205172 27183550 27221767 27207192 27186851| 27183550 [ 5732660]
Novak unrolled: 76893278 76881102 76828362 76905661 76913653| 76828362 [10591108]
Fletcher: 61012927 60973684 60990148 60993752 60974823| 60973684 [14915258]
For Intel Atom N450 Pineview-N 1.66GHz, the contents of 'Word-list_00,351,114_SPEED.txt' which in fact is the spell-checker's words:
Word-list_00,351,114_SPEED.txt
1048576 elements in the table (20 bits)
351114 lines read
FNV1A_Nefertiti: 306194 306573 306192 308890 306516| 306192 [ 52963]
FNV1A_Whiz: 309750 309663 315782 309936 309856| 309663 [ 52966]
Sixtinsensitive+: 311474 311330 311115 314129 313496| 311115 [ 53040]
Alfalfa_HALF: 312465 312981 312385 316337 312773| 312385 [ 52454]
K&R: 313431 321837 312990 319829 313343| 312990 [ 52642]
FNV1A_Peregrine: 319362 329116 322006 319367 319748| 319362 [ 52551]
Bernstein: 321816 321772 321580 324893 322049| 321580 [ 52770]
x17 unrolled: 322271 321899 321849 324666 322026| 321849 [ 53556]
Paul Hsieh: 326436 322757 322904 322812 323075| 322757 [ 52729]
Sixtinsensitive: 325249 325427 325026 331674 325273| 325026 [ 53081]
Murmur2: 326730 326620 326377 332203 326445| 326377 [ 52738]
FNV1A_Smaragd: 329728 326406 329167 326707 326948| 326406 [ 52774]
lookup3: 334076 327371 327638 328289 328143| 327371 [ 52868]
Alfalfa: 329238 329437 329020 331666 330027| 329020 [ 52594]
Arash Partow: 333027 330317 330431 338750 330578| 330317 [ 52887]
Ramakrishna: 337692 331050 330770 330642 331796| 330642 [ 52764]
CRC-32: 335909 332821 332915 333260 332676| 332676 [ 52931]
Paul Larson: 335006 341362 335552 335411 335263| 335006 [ 52970]
Alfalfa_DWORD: 335984 336655 335764 341768 336394| 335764 [ 52594]
x65599: 337815 338067 337403 339999 337224| 337224 [ 52988]
Novak unrolled: 338616 338544 340872 339003 338570| 338544 [ 70274]
Alfalfa_QWORD: 342953 343383 343121 345655 346083| 342953 [ 52594]
Hanson: 345076 345541 344846 347805 345012| 344846 [ 57741]
Sedgewick: 346944 346191 348871 346527 346158| 346158 [ 52920]
FNV-1a: 352768 352508 357287 352658 352651| 352508 [ 52829]
SBox: 353603 353294 359148 353330 353175| 353175 [ 52688]
One At Time: 370592 369297 368441 368550 368128| 368128 [ 52836]
MaPrime2c: 398537 400557 398662 398581 398967| 398537 [ 52435]
Weinberger: 415567 418102 416656 415559 416279| 415559 [ 103386]
Fletcher: 440953 440887 440794 443035 441821| 440794 [ 182747]
For Intel Atom N450 Pineview-N 1.66GHz, the contents of 'Word-list_22,202,980_SPEED.txt' which in fact is the wikipedia-de-en-es-fr-it-nl-pt-ro-html.tar's words:
67108864 elements in the table (26 bits)
22202980 lines read
K&R: 27212349 27196471 27200039 27186453 27196511| 27186453 [ 3290941]
Alfalfa_HALF: 27356819 27307195 27297075 27307918 27303211| 27297075 [ 3286890]
x17 unrolled: 27339582 27350110 27357010 27339325 27356507| 27339325 [ 3830652]
Bernstein: 27868098 27885616 27865719 27879787 27863864| 27863864 [ 3290766]
Ramakrishna: 28665899 28666251 28682862 28664667 28671544| 28664667 [ 3321824]
Alfalfa: 28888653 28886944 28877923 28887710 28891342| 28877923 [ 3288684]
FNV1A_Nefertiti: 29213855 29223509 29198537 29215353 29230921| 29198537 [ 3505371]
Alfalfa_DWORD: 29369155 29377827 29371609 29385218 29372532| 29369155 [ 3288684]
Paul Larson: 29416290 29420142 29438150 29438707 29445835| 29416290 [ 3296692]
Sixtinsensitive+: 29620313 29611448 29599723 29620233 29601234| 29599723 [ 3507772]
Arash Partow: 29601009 29607096 29607259 29604831 29600309| 29600309 [ 3325683]
Alfalfa_QWORD: 29766556 29768872 29761910 29768865 29750096| 29750096 [ 3288684]
Sixtinsensitive: 30910025 30227158 30218696 30253983 30229414| 30218696 [ 3373923]
FNV1A_Whiz: 30529924 30573778 30522564 30513371 30532538| 30513371 [ 3369088]
x65599: 30639384 30626734 30637721 30629312 30667952| 30626734 [ 3325064]
FNV1A_Peregrine: 31062527 31065744 31063962 31074164 31060828| 31060828 [ 3333193]
Sedgewick: 31231177 31217657 31227213 31231472 31210256| 31210256 [ 3302263]
Paul Hsieh: 31504060 31514350 31516370 31506898 31500243| 31500243 [ 3498543]
lookup3: 31603211 31576806 31546659 31551113 31562355| 31546659 [ 3299369]
FNV1A_Smaragd: 31788377 31580190 31580953 31575398 31584602| 31575398 [ 3298433]
Murmur2: 31588844 31591137 31578842 31585076 31581288| 31578842 [ 3297709]
CRC-32: 31993878 31986891 31976616 31993457 31978443| 31976616 [ 3298998]
Hanson: 32697671 32708803 32703678 32697383 32699428| 32697383 [ 3408497]
FNV-1a: 33097899 33081494 33091676 33097100 33087453| 33081494 [ 3297552]
Weinberger: 33148065 33152666 33133844 33143483 33153828| 33133844 [ 5732660]
SBox: 33418326 33433409 33412257 33432103 33433903| 33412257 [ 3298021]
One At Time: 34447974 34462104 34469829 34453808 34453369| 34447974 [ 3304908]
MaPrime2c: 36544157 36541591 36553914 36532887 36536317| 36532887 [ 3299747]
Fletcher: 85211946 85207760 85213725 85234580 85240698| 85207760 [14915258]
Novak unrolled: 122659701 122684073 122688489 122710846 122677634| 122659701 [10591108]
For Intel Q9550 2.83GHz Yorkfield, the contents of 'Word-list_00,351,114_SPEED.txt' which in fact is the spell-checker's words:
1048576 elements in the table (20 bits)
351114 lines read
FNV1A_Whiz: 126979 126965 126909 127074 126892| 126892 [ 52966]
FNV1A_Nefertiti: 127558 127674 127561 127462 127655| 127462 [ 52963]
Novak unrolled: 133056 131005 130946 130748 130608| 130608 [ 70274]
FNV1A_Peregrine: 131261 131375 132194 131400 131466| 131261 [ 52551]
FNV1A_Smaragd: 135540 132354 132447 132105 134366| 132105 [ 52774]
Sixtinsensitive+: 140261 133781 132937 132972 133011| 132937 [ 53040]
Alfalfa: 133692 134306 134139 133886 133746| 133692 [ 52594]
x17 unrolled: 138124 135728 135652 138535 136985| 135652 [ 53556]
Alfalfa_HALF: 136099 135873 135958 136045 135852| 135852 [ 52454]
Alfalfa_DWORD: 139382 136591 137320 136955 136703| 136591 [ 52594]
SBox: 136660 136629 136652 139976 137095| 136629 [ 52688]
CRC-32: 136776 136682 137740 136886 137013| 136682 [ 52931]
Sixtinsensitive: 137325 139569 137416 138149 137460| 137325 [ 53081]
Murmur2: 137708 137451 137466 137370 137645| 137370 [ 52738]
Paul Larson: 137963 137860 137821 137806 137844| 137806 [ 52970]
Alfalfa_QWORD: 138891 140860 138630 138520 139282| 138520 [ 52594]
Hanson: 138905 138985 141471 140417 139221| 138905 [ 57741]
x65599: 139179 141376 138957 139922 139037| 138957 [ 52988]
K&R: 142387 142494 142280 142263 142276| 142263 [ 52642]
Bernstein: 143426 143500 143558 144583 143804| 143426 [ 52770]
Paul Hsieh: 144100 144173 145563 144810 144435| 144100 [ 52729]
Sedgewick: 145909 145885 145971 145638 146653| 145638 [ 52920]
lookup3: 147183 147415 148043 149721 147718| 147183 [ 52868]
FNV-1a: 147512 147545 147397 148011 149087| 147397 [ 52829]
Ramakrishna: 148659 156452 148559 148661 148236| 148236 [ 52764]
Fletcher: 149293 149231 149835 151863 149492| 149231 [ 182747]
Arash Partow: 149436 149511 149354 149420 149339| 149339 [ 52887]
MaPrime2c: 152965 153024 152778 153176 152732| 152732 [ 52435]
One At Time: 159041 158835 164564 159598 158996| 158835 [ 52836]
Weinberger: 197287 197401 197641 199104 199606| 197287 [ 103386]
For Intel Q9550 2.83GHz Yorkfield, the contents of 'Word-list_22,202,980_SPEED.txt' which in fact is the wikipedia-de-en-es-fr-it-nl-pt-ro-html.tar's words:
67108864 elements in the table (26 bits)
22202980 lines read
x17 unrolled: 16711968 16714788 16712787 16715049 16715557| 16711968 [ 3830652]
Alfalfa_HALF: 16925846 16921816 16935959 16918072 16920664| 16918072 [ 3286890]
Alfalfa: 16949669 16948259 16945026 16947268 16949278| 16945026 [ 3288684]
Alfalfa_DWORD: 17087653 17089499 17090082 17092342 17092248| 17087653 [ 3288684]
K&R: 17217212 17215582 17214916 17211997 17210891| 17210891 [ 3290941]
Alfalfa_QWORD: 17284333 17281698 17296440 17284594 17287022| 17281698 [ 3288684]
Paul Larson: 17287965 17288993 17286715 17288186 17290720| 17286715 [ 3296692]
Bernstein: 17441366 17441744 17446211 17446642 17448115| 17441366 [ 3290766]
x65599: 17516305 17512376 17512059 17510499 17514638| 17510499 [ 3325064]
FNV1A_Whiz: 17578315 17575074 17574692 17572083 17571654| 17571654 [ 3369088]
FNV1A_Nefertiti: 17812040 17810394 17815863 17800634 17802965| 17800634 [ 3505371]
Sedgewick: 17813196 17805720 17809107 17808987 17809341| 17805720 [ 3302263]
Ramakrishna: 17821031 17818277 17816051 17817365 17820101| 17816051 [ 3321824]
FNV1A_Smaragd: 18156291 18032451 18056304 18014531 18010227| 18010227 [ 3298433]
FNV1A_Peregrine: 18037980 18041105 18047264 18049949 18050438| 18037980 [ 3333193]
Arash Partow: 18159192 18218352 18156661 18148943 18148120| 18148120 [ 3325683]
Sixtinsensitive+: 18273937 18273231 18272381 18276826 18272754| 18272381 [ 3507772]
CRC-32: 18299132 18303268 18300555 18307328 18296881| 18296881 [ 3298998]
Murmur2: 18312461 18309171 18315023 18312814 18306135| 18306135 [ 3297709]
Sixtinsensitive: 18326606 18325448 18319761 18326255 18320762| 18319761 [ 3373923]
SBox: 18454359 18453094 18457853 18448201 18451512| 18448201 [ 3298021]
Hanson: 18592317 18585036 18585941 18597290 18595878| 18585036 [ 3408497]
Paul Hsieh: 18931940 18935308 18939618 18942478 18940287| 18931940 [ 3498543]
lookup3: 18978329 18970650 18969792 18967985 18983381| 18967985 [ 3299369]
FNV-1a: 19055865 19051343 19051202 19054812 19049558| 19049558 [ 3297552]
MaPrime2c: 19497746 19470760 19474411 19474490 19476430| 19470760 [ 3299747]
One At Time: 19826720 19775854 19767551 19765074 19763605| 19763605 [ 3304908]
Weinberger: 21866786 21864914 21867700 21862238 21866417| 21862238 [ 5732660]
Fletcher: 44235533 44193774 44142011 44138717 44158624| 44138717 [14915258]
Novak unrolled: 52067982 52067088 52068005 52064574 52068712| 52064574 [10591108]
Any suggestions(adding more files/functions/...) are appreciated.
Something strange(beyond my understanding) is going on: on wikipedia's words a significant hampering appeared from nowhere!? Any idea for this mystery?
Just for longer keys here comes the Jester - an unrolled Whiz:
UINT FNV1A_Hash_Jester(const char *str, SIZE_T wrdlen)
{
const UINT PRIME = 709607;
UINT hash32 = 2166136261;
const char *p = str;
// Idea comes from Igor Pavlov's 7zCRC, thanks.
/*
for(; wrdlen && ((unsigned)(ptrdiff_t)p&3); wrdlen -= 1, p++) {
hash32 = (hash32 ^ *p) * PRIME;
}
*/
for(; wrdlen >= 2*sizeof(DWORD); wrdlen -= 2*sizeof(DWORD), p += 2*sizeof(DWORD)) {
hash32 = (hash32 ^ *(DWORD *)p) * PRIME;
hash32 = (hash32 ^ *(DWORD *)(p+4)) * PRIME;
}
// Cases: 0,1,2,3,4,5,6,7
if (wrdlen & sizeof(DWORD)) {
hash32 = (hash32 ^ *(DWORD*)p) * PRIME;
p += sizeof(DWORD);
}
if (wrdlen & sizeof(WORD)) {
hash32 = (hash32 ^ *(WORD*)p) * PRIME;
p += sizeof(WORD);
}
if (wrdlen & 1)
hash32 = (hash32 ^ *p) * PRIME;
return hash32 ^ (hash32 >> 16);
}
262144 elements in the table (18 bits)
105982 lines read
FNV1A_Jester: 52198 52248 51275 52071 50515| 50515 [ 18774]
FNV1A_Whiz: 51168 52620 51802 52121 50774| 50774 [ 18774]
1048576 elements in the table (20 bits)
351114 lines read
FNV1A_Jester: 237708 235011 234586 234747 235305| 234586 [ 52966]
FNV1A_Whiz: 239372 239256 238010 238354 239165| 238010 [ 52966]
http://encode.ru/threads/1160-Fastest-non-secure-hash-function!
Hi Peter,
Having read some critics I added the test on enwik8 some 900,000 lines with all kind of keys in it:
You are welcome to home of FNV1A_Jesteress:
http://www.sanmayce.com/Fastest_Hash/index.html
Regards
Georgi's page contains an interesting link to:
http://cbloomrants.blogspot.com/2010/11/11-19-10-hashes-and-cache-tables.html
It's written by the guy behind http://www.cbloom.com and if I understand correctly he is also interested in using hashes for compression algorithms, not only in compiler-needed hash tables.
Georgi, I've added Jesteress to the test; thank you very much for your efforts. Jesteress is even faster than Whiz! :) The number of collisions in Numbers test remains high, though.
Ace, thanks for the link. He benchmarked STLport, a complex implementation that grows the hash table, uses jump tables and prime numbers for table size. Something similar to what you wanted to do.
I finally got stable results on Core i5 by increasing the number of runs. There are some potentially useful ideas for statistical tests (chi-square, etc.) at Murmur page.
Peter, I do not care about numbers, IPs and other numeric datasets, anyway just for making all fans happy here comes Meiyan(Beauty, Charming Eyes or most precisely: SOULFUL EYES):
FNV1A_Jesteress gives way to FNV1A_Meiyan because of better mixing - the last DWORD is split into two WORD passes to avoid losing the carries.
Thus, as in one of my favorite movies where Meiyan transforms(only physically) into a princess from an ugly imp similarly here the ugly(because of numeric ugliness) Jesteress evolves to a sub-fantastic entity.
Intel Core i5 gives:
Intel Merom gives:
500 lines read
1024 elements in the table (10 bits)
1992 lines read
4096 elements in the table (12 bits)
500 lines read
1024 elements in the table (10 bits)
500 lines read
1024 elements in the table (10 bits)
500 lines read
1024 elements in the table (10 bits)
1842 lines read
4096 elements in the table (12 bits)
3228 lines read
8192 elements in the table (13 bits)
13408 lines read
32768 elements in the table (15 bits)
3925 lines read
8192 elements in the table (13 bits)
919074 lines read
2097152 elements in the table (21 bits)
10920423 lines read
33554432 elements in the table (25 bits)
12561874 lines read
33554432 elements in the table (25 bits)
22202980 lines read
67108864 elements in the table (26 bits)
Meiyan: "Hell that no fury like a beauty scorned. Haven't you heard?"
Meiyan: "No wonder they say all good people are tricky."
Ha-ha, right said Meiyan.
Meiyan: "I'm not garbage. Don't just throw me away. I'm not just a thing. I have a name! My name is Meiyan."
Peter,
I knew where the weakness(regarding collisions) was, it is no more i.e. it is amended with arrival of FNV1A_Mantis the strongest of all my FNV1A variants.
Now FNV1A_Meiyan is between FNV1A_Jesteress and FNV1A_Mantis.
Personally my favorite(because my way of using her differs a lot from that of other people) is still FNV1A_Jesteress despite her apparent collision drawback in interval 8chars to 10chars: due to either one BYTE(8+1) or WORD(8+2) mix.
I tried(unseriously) to fix it in FNV1A_Meiyan, so here comes FNV1A_Mantis(00562-004a0=194bytes fattest so far):
Ok, now some heavy hash hustle with fixed-length-ASCII-strings, in my opinion this is the most relevant and close to practice(here: the fundamental match finding) benchmark.
At link below this summer I approached in a dummy way LZ match finding by counting the repetitions(in rich of English text OSHO.TXT) through Building-Blocks_DUMPER:
http://encode.ru/threads/1134-Dummy-Static-Windowless-Dictionary-Text-Decompressor?p=22653&viewfull=1#post22653
Length of Building-Blocks / Quantity of ALL(with overlapping) Building-Blocks / Quantity of DISTINCT(with overlapping) Building-Blocks / Quantity of REPETITIVE(with overlapping) Building-Blocks
Volume in drive D is H320_Vol5
Volume Serial Number is 0CB3-C881
Building-Blocks_DUMPER rev.2, written by Kaze.
Note: This revision converts CR to $ and LF to # in order to have lines patternlen long ending with LF.
Sorting 206908947 Pointers to Building-Blocks 3 chars in size ...
Allocated memory for pointers-to-words in MB: 790
Writing Sorted Building-Blocks to BB003.txt ...
Sorting 206908946 Pointers to Building-Blocks 4 chars in size ...
Allocated memory for pointers-to-words in MB: 790
Writing Sorted Building-Blocks to BB004.txt ...
Sorting 206908945 Pointers to Building-Blocks 5 chars in size ...
Allocated memory for pointers-to-words in MB: 790
Writing Sorted Building-Blocks to BB005.txt ...
Sorting 206908944 Pointers to Building-Blocks 6 chars in size ...
Allocated memory for pointers-to-words in MB: 790
Writing Sorted Building-Blocks to BB006.txt ...
Sorting 206908943 Pointers to Building-Blocks 7 chars in size ...
Allocated memory for pointers-to-words in MB: 790
Writing Sorted Building-Blocks to BB007.txt ...
Sorting 206908942 Pointers to Building-Blocks 8 chars in size ...
Allocated memory for pointers-to-words in MB: 790
Writing Sorted Building-Blocks to BB008.txt ...
Sorting 206908941 Pointers to Building-Blocks 9 chars in size ...
Allocated memory for pointers-to-words in MB: 790
Writing Sorted Building-Blocks to BB009.txt ...
Sorting 206908940 Pointers to Building-Blocks 10 chars in size ...
Allocated memory for pointers-to-words in MB: 790
Writing Sorted Building-Blocks to BB010.txt ...
Sorting 206908939 Pointers to Building-Blocks 11 chars in size ...
Allocated memory for pointers-to-words in MB: 790
Writing Sorted Building-Blocks to BB011.txt ...
Sorting 206908938 Pointers to Building-Blocks 12 chars in size ...
Allocated memory for pointers-to-words in MB: 790
Writing Sorted Building-Blocks to BB012.txt ...
Building-Blocks_DUMPER total time: 2658329 clocks
Volume in drive D is H320_Vol5
Volume Serial Number is 0CB3-C881
Volume in drive D is H320_Vol5
Volume Serial Number is 0CB3-C881
LONG2DOT.EXE, revision 001.
Written by Svalqyatchx 'Kaze'.
Example: liner ip.csv ip.txt
Output file: IPS.TXT
Lines: 2995394
LONG2DOT: Done.
8388608 elements in the table (23 bits)
2995394 lines read
8388608 elements in the table (23 bits)
2995394 lines read
46486 lines read
67108864 elements in the table (26 bits)
248019 lines read
67108864 elements in the table (26 bits)
855682 lines read
67108864 elements in the table (26 bits)
2236138 lines read
67108864 elements in the table (26 bits)
4803152 lines read
67108864 elements in the table (26 bits)
8956496 lines read
67108864 elements in the table (26 bits)
15006172 lines read
67108864 elements in the table (26 bits)
22992127 lines read
67108864 elements in the table (26 bits)
32707519 lines read
67108864 elements in the table (26 bits)
43802365 lines read
67108864 elements in the table (26 bits)
in order to enrich the versatility of your testbed my suggestion is Heavy-IP(IPs.TXT 2,995,394 keys) dataset to be added(as a table) to en-wikipedia dataset, I think millions of words&IPs are a must-show basis. I don't like(speaking of some critics) words as 'untrustworthy' to be connected with my name.
As for the match finding test I guess the criticizers must offer-first shoot-next.
Regards.
Oh, I forgot to give the testbed for all this dumps above:
http://www.sanmayce.com/Downloads/_KAZE_hash_test_r3.7z
thanks, but Mantis reads beyond buffer boundary if (wrdlen < 15). For example, if wrdlen == 7:
Yes Peter, again a stupid error from side, caused by hurry-mode in which I am these days, I saw it yesterday 3hours after updating my site, and fixed with 'if (wrdlen) {}' wrapping.
After 1 hour I will post again, last night I remade the 200MB test also, with some big table in mind for the weekend.
Mantis is a very good predator/hash, I will be glad if you test it on your machine.
Thanks.
Mantis source and Benchmark-Dumps also (http://www.sanmayce.com/Downloads/_KAZE_hash_test_r3.7z) (224,916,121 bytes) fixed now,
here is the bugless Mantis:
Mantis results on Pentium M:
Core i5:
Generally, Mantis has similar number of collisions to Meiyan, but Mantis is slower.
Thanks for testing,
at link below is my attempt to present Heavy-Hash-Hustle dumps in a more digestible fashion:
http://www.sanmayce.com/Fastest_Hash/index.html#Heavy-Hash-Hustle
Intel Core 2 Quad Q9550S Yorkfield 2.83GHz 12MB L2 Cache:
8388608 elements in the table (23 bits)
2995394 lines read
Intel Atom N450 1.66GHz 512KB L2 Cache:
8388608 elements in the table (23 bits)
2995394 lines read
My wish is to cover all meaningful(at least for LZ) lengths that is 3..66 bytes but a different approach must be commenced because of HUGE size of dataset:
Speaking of very precious(regarding English language usage, and original thoughts used, also hundreds of books included) OSHO.TXT I propose one simple way of achieving Building-Blocks hashing:
loading 197MB(the file itself) and hashing(3..66 chunks) at each position(i.e. one byte increment).
Another thing I want to share regarding collision managing:
approaches(rehashing, chains, ...) without definite goals i.e. context are like kata(detailed choreographed patterns of movements)(the real fight is an extension/mix of kata/techniques with complex timing which includes awareness of timings of outer things not just your own timing), I mean if enough(free RAM) resources are given not utilizing/exploiting them and talking about speed as this-and-that is a dead-end.
For example I tested(now commented) a FNV1A variant hash function in Leprechaun which outperforms(hash time + lookup time) FNV1A_Jesteress by (1,8??,???-1,6??,???)/1,6??,???*100%= 12.5%, but it is completely due to B-tree used as collision manager at final stage.
This very hasher performs not well while other techniques are used, though.
The point is, speed is something beyond all limitations imposed, it must be chased for each niche relentlessly.
One phenom in real world is Mr. Bolt: his fantastic technique is being constantly improved, or as he says in one interview he and his trainer work on even faster than one of the fastest starts in 100M races. It's just amazing the tallest sprinter to have one of most explosive starts as well. And even more amazing is the will for improvements.
I have a sambo practitioner buddy who had said about his 100/200M records: "What technique? It is just left-right left-right!"
Of course I did disagree. Neglecting the basic/fundamental stances leads to nasty future slips(a kind of 'O! What happened' i.e. lack of further deep-understanding/technique-improvement).
I fully agree with:
http://cbloomrants.blogspot.com/2010/11/11-29-10-useless-hash-test.html
Looking at my 3chars..12chars Building-Block test I see the strong candidate for A future ultimate testbed.
Hi Peter,
pre-yesterday a documentary movie on History Channel about Nicola Tesla (an outstanding man not only a pragmatic visionary) inspired me to tune an almost forgotten hasher.
Here comes FNV1A_Tesla: a suitable hasher for keys [much] longer than 15 bytes (the case of 3+grams phrases).
That is the very (with the 64bitx32bit->32bit multiplication and loss of carry) hash function I was talking about previously.
In all tests, below, FNV1A_Tesla outspeeds all my FNV1A variants.
Surprisingly the bad collision rate doesn't affect its speed, I have been hit [again] by the fact that the brutal loss of data doesn't (when keys are not with ONLY weak-range lengths) hurt the lookup.
Here the speed/dispersion trade-off was made in favor of speed of course.
The function itself:
I am curious what amendments can be done. It is revision 1.
My 64bit knowledge/experience is next to nothing, so it would be nice somebody to refine it especially for 64bit compilers.
Some tests (on my Intel Merom 2.16GHz, Windows XP 32bit, VS2008 32bit compiler):
Volume in drive D is H320_Vol5
Volume Serial Number is 0CB3-C881
22202980 lines read
67108864 elements in the table (26 bits)
12561874 lines read
33554432 elements in the table (25 bits)
351114 lines read
1048576 elements in the table (20 bits)
38936 lines read
131072 elements in the table (17 bits)
584879 lines read
2097152 elements in the table (21 bits)
32359 lines read
65536 elements in the table (16 bits)
1 lines read
4 elements in the table (2 bits)
2995394 lines read
8388608 elements in the table (23 bits)
17981107 lines read
67108864 elements in the table (26 bits)
Regards
I am hunting for an extremely fast integer->integer hashing method for working with a large array of hashtables, in particular, where there are a high number of key (re)inserts, key (re)deletions, and value (re)updates within each hashtable, as large volumes of data are processed. Currently writing in C, but open to inlining assembly if it offers nice gains.
By the will of the hash(ing) gods ... show me the way!
Testing avalanche on integer hash functions
http://baagoe.org/en/wiki/Avalanche_on_integer_hash_functions
I'd love to use some hashes in PHP but not have to enable or install an extension to do so, naturally speed and efficiency would suffer, but the "portability" of the code makes the trade off worth it for my needs. I'd love to have lookup3/SuperFastHash ported to a php function, even One-At-A-Time would be great!
How about adding some CityHash hashes to the comparison?
Thanks Peter! I've found your blog extremely useful. Your discussion with @Sanmayce led me to your blog.
Keep up a good job!
Cuckoo hashing: neither linear probing nor chaining
http://en.wikipedia.org/wiki/Cuckoo_hashing
I found the following to be an interesting hash function
http://code.google.com/p/xxhash/
Additional info can be found here:
http://fastcompression.blogspot.ca/2012/04/selecting-checksum-algorithm.html
I concur with the general conclusion that Adler-32 should not be used as a hash function. It "fills up" with information from the input very slowly, which is definitely not what you want in a hash function, especially when used on short strings.
However the observation above: "The second problem is that the characters are not "weighted" (multiplied by different numbers), so that Adler-32("01") = Adler-32("10"), that's why it fails the Numbers test. Ditto for anagrams in Shakespeare's sonnets: Adler-32("heart") = Adler-32("earth")." is not correct. The Adler-32 of "01" is 0x00930062, whereas the Adler-32 of "10" is 0x00940062. Similarly, the Adler-32 of "heart" is 0x061c0215, and the Adler-32 of "earth" is 0x06280215.
The leading zeros in each half-word of those last two is indicative of the slow filling that I referred to.
I'm sorry, I've corrected the mistake in the article and added a link to your comment. The checksum actually has no collisions for the numbers, but when the values are reduced modulo the hash table size, they collide. For example, the hash table size is 216:
It helps if you XOR the lower and the higher part, but there is still a lot of collisions:
As you said, the function should be used as a checksum, not as a hash function.
Hi Peter,
just wanted to see how the fastest (regarding linear speed) hasher FNV1A_Tesla (64bit) rewritten down to 32bit would behave, so here comes my new favorite FNV1A_Yorikke - the fastest 32bit hasheress.
She outspeeds both FNV1A_Jesteress and FNV1A_Meiyan featuring collisions comparable to CRC32.
I wonder how the new approach (hashing two lines) behaves on i5/i7, my expectations are that FNV1A_Yorikke is gonna scream.
I hate the fact that I still cannot play with an i7 machine, so the following results are obtained on my laptop T7500 2200MHz:
500 lines read
1024 elements in the table (10 bits)
1992 lines read
4096 elements in the table (12 bits)
500 lines read
1024 elements in the table (10 bits)
500 lines read
1024 elements in the table (10 bits)
500 lines read
1024 elements in the table (10 bits)
1842 lines read
4096 elements in the table (12 bits)
3228 lines read
8192 elements in the table (13 bits)
13408 lines read
32768 elements in the table (15 bits)
3925 lines read
8192 elements in the table (13 bits)
3333 lines read
8192 elements in the table (13 bits)
2995394 lines read
8388608 elements in the table (23 bits)
584879 lines read
2097152 elements in the table (21 bits)
Wikipedia en
12561874 lines read
33554432 elements in the table (25 bits)
22202980 lines read
67108864 elements in the table (26 bits)
1 lines read
4 elements in the table (2 bits)
5000000 lines read
16777216 elements in the table (24 bits)
In last test we hash 5 million 128bytes long lines (kind of super-heavy-prefix test).
It resembles having very similar (sharing one long prefix) 5 million 128 chars long tweets.
Obviously we need more versatile tests (like my 'Knight Tours') in order to find-and-fix possible weak points.
Here x17 unrolled, FNV-1a, Larson, SBox fail to keep up with others.
In pre-last (linear hash speed - 100MB as one line) test XXHfast32 outspeeds FNV1A_Yorikke with (176506-168528)/168528*100% = 4.7%, is this the case on i5/i7 machines!?
In my view FNV1A_Yorikke has heart full of soul or "fine and clean like polished gold" (just as B.Traven describes her), yet, more torture is needed.
You all are welcome to download my (your benchmarker is used) latest (all above tests included) hash package:
If we have to hash some 5,000,000,000+ tweets, the things would rapidly go ugly.
It would be useful to add more fault-finding tests, thus we can have/rely on some hasher for "real world" (i.e. heavy) loads.
Regards
Very informative analysis. Sincerely thank you Peter. Made my job of finding a good hash function so much easier with the research you have conducted in this topic.
Hi Piter,
Thanks for the article. It would be great if you make your test for linux too. You will be surprised with the results.
For me the best hash function here is “Yorikke”.
Thank You Sanmayce!
Thank you Mr. Norton,
I wrote Knight-tour_r8dump_Yorikke.c (downloadable at link below) which hashes Knight-Tours on the fly in order to monitor the fattest slot(s).
This torture-test uses 27bit hash table i.e. 134,217,728 slots.
http://www.sanmayce.com/Fastest_Hash/index.html#KT_torture
Up to last night the test reached 550,000,000 Knight-Tours:
FNV1A_Yorikke has 005 slots with MAX_depthness 019 (number of x's).
CRC32 has 001 slot with MAX_depthness 021 (number of x's).
In other words MAX_depthness 019 means maximum 019 layers or maximum 019 keys sharing one slot.
I will wait until all the 5,000,000,000 Knight-Tours are hashed, maybe a week more.
This is the heaviest torture-hash-test, no?
Also I am very glad of Mr. Noll's retweet:
Dear Sanmayce,
I would like to thank You for the intention to share yours amazing real McCoy hash functions!
When you finish the test I hope people to see the “Yorikke” beauty as me.
In our “hostile” 24/7 *nix environment the speed counts and “Yorikke” rocks!
It’s a treasure and state of the art work.
Thanks again and all the best!
Some idea...
by Alexander Myasnikov
Some idea to speed up (but little more collisions)
by Alexander Myasnikov
amsoftware.narod.ru (amsoftware at ya.ru)
Unfortunately, the hash functions don't perform well on my benchmark. The results of "IPv4" and "Numbers" tests are especially bad:
Some collision and speed tests on different data sets (Russian)
http://amsoftware.narod.ru/algo2.html
xxHash has been updated recently :
http://code.google.com/p/xxhash/
Hi, I've created the following hash function for my own use, and found that using random generated strings, I repeatedly get 0 collisions. I repeatedly tested with 10 million such random strings with same 0 collisions, and also compared the same sets of strings with the djb-function - the djb function averaged more than 200 000 collisions.
I lack testing sources/techniques and would be very interested in the results if you would be so kind as to test it.
Here's the function:
union
unsigned long t;
PS to my previous post:
I'm using a 64-bit machine and sizeof(long) == 8 bytes
and I think the collisions will start in earnest with
strings that exceed 63 bytes in length (then (i/d)>=8)
and the shift s[i] may result in 0 - just a guess.
(sorry - I feel like a spammer;)
Here is a revised/improved version of the function:
I tested it against MurmurHash2 with the same (my) data-sets and the results were about the same. I would really appreciate a comparative test - the function, for it's simplicity, seems to work remarkably well :)
It looks like SipHash is becoming the new standard for hashing short messages. I'm curious to see how it performs on your benchmarks.
https://131002.net/siphash/
Typical hash map in web apps is JSON object. It has short attribute names, and not a lot of entries.
Let's assume average size of identifier is 12 chars, and number of entries - say, in the range 16-32 (certainly less than 256)
Clearly, we need a hashing function optimized for this case.
In this regard, it would be interesting to see statistics for Pearson algo in your research.
NOTE:
For short identifiers, amortized cost of last iteration' branch misprediction is very high- approx 1 cycle/byte, so probably it would make sense
to have N specialized versions (for string length 1, 2, ... say, 24), and use indirect jump (costs 5 cycles), this will translate to 1 cycle per char in savings -
comparable with the cost of computation per char itself.
What do you think?
Choosing a Good Hash Function, Part 3
http://blog.aggregateknowledge.com/2012/02/02/choosing-a-good-hash-function-part-3/
Hi guys,
I am happy to share my latest best.
A remainderless variant for 16[+] bytes keys appeared while I was playing with Yorikke and wanting to try an old idea of mine - to reduce the branching.
When the GOLDEN Yorikke was Interleaved&Interlaced a DIAMANT appeared - it's time for a new generation slasher: FNV1A_Yoshimura.
Up to now the TOP benchmark results I was able to gather:
On AMD (Phenom II X6 1600T, 4000MHz) FNV1A_YoshimitsuTRIAD reigns with 11.360MB per clock or 11360/1024 = 11.093GB/s.
It is worth the attempt to explore the Jesteress-Yorikke (i.e. 1 hash line 4+4 vs 2 hash lines 4+4) 8-16 GAP in order to lessen their collisions further more.
Simply put, 3 hash lines 4 bytes each, 12 bytes per loop.
The 'non power of 2' workaround I see as one MONOLITH function with no remainder mixing at all.
The idea #1 is to exterminate all nasty IFs outwith the main loop, I believe such branchless etude will outperform Jesteress.
The idea #2 is to STRESS memory by fetching not-so-adjacent areas.
For example:
Key: hash_with_overlapping_aye_aye
Key_Length: 29
Loop #1 of 3:
Hash line 1: hash
Hash line 2: h_ov
Hash line 3: ping
Loop #2 of 3:
Hash line 1: _wit
Hash line 2: erla
Hash line 3: _aye
Loop #3 of 3:
Hash line 1: h_ov
Hash line 2: ppin
Hash line 3: _aye
I don't know the internals, whether lines are 32/64/128 bytes long is a secondary concern.
Well, the key is too short, in reality the above key may span only 1|2 cache lines, if the key is longer than 4 cache lines (assuming 32) e.g. 128+2 bytes then it may span 5|6 lines.
My dummy/clueless premise is that is possible (in future systems) to access effectively RAM in such manner.
Does someone know whether such type of accessing has any practical value in nowadays CPUs?
Of course, it is worth trying to "interleave" in that way all the short string hashers, yes?
Anyway the 3 lines are in stack, for the time being let's see how 'INTERLEAVED' Yorikke, which I called FNV1A_Yoshimura, behaves.
To reproduce the quick-test below here comes: http://www.sanmayce.com/Fastest_Hash/DOUBLOON_hash_micro-package_r2.zip
The results on my 'Bonboniera' T7500, throwing mostly 16+ long keys at the "awful greedy country samurai":
Intel 12.1:
3333 lines read
8192 elements in the table (13 bits)
1000 lines read
2048 elements in the table (11 bits)
Yoshimura: 235 [ 187] ??? the slowest of all the four Yo*, something to ponder about ???
32359 lines read
65536 elements in the table (16 bits)
Microsoft 16:
3333 lines read
8192 elements in the table (13 bits)
1000 lines read
2048 elements in the table (11 bits)
32359 lines read
65536 elements in the table (16 bits)
Mr.Norton you are most welcome, maybe you have already spotted that I don't target 64bit stamps at all, yet, if you are interested I can write r.3 of my 'Tesla' function using these (Interleaving & Interlacing) nifty techniques - I believe it will outperform itself being the fastest 64bit hash in m^2 testbench.
I thank Przemyslaw Skibinski and Maciej Adamczyk (m^2) for their 64bit testbench which I included along with yours (Peter) in the benchmark:
http://www.sanmayce.com/Fastest_Hash/DOUBLOON_hash_micro-package_r3.zip
Results below are for the Intel 12.1 32bit executable on my laptop:
1 lines read
4 elements in the table (2 bits)
1 lines read
4 elements in the table (2 bits)
1 lines read
4 elements in the table (2 bits)
I wrote revision 3 of FNV1A_Tesla as 64bit counterpart of FNV1A_Yoshimura and included them into the 64bit linear speed test by Przemyslaw and Maciej, the results are (I threw the 200MB at the hashers):
As console screenshots:
http://www.sanmayce.com/Fastest_Hash/64bit_page1_.png
http://www.sanmayce.com/Fastest_Hash/64bit_page2_.png
As console text dumps:
FNV1A_Yoshimura is simply DIAMANTINE.
VMAC-64, cryptographically strong keyed hash, runs at x64 with 4 bytes/cycle speed. i.e. 16 GB/s
Thank you. VMAC-64 is a cryptographic hash function, not a function for hash tables. Most likely, it will be slow on the short strings that are used in my benchmark. You are welcomed to benchmark it and post the results here.
For one, Bulat's point is from data deduplication context, am I right?
There VMAC(VHASH) IS a function for hash tables, where keys can be chunks several KB long.
Though not "natively" targeted for HT it could be used as such, similarly to CRC32 not a "native" HT function but a checksum.
Looking at VMAC(VHASH)'s main loop (the version below does 64-bytes of message at a time):
It looks elegant, straight-forward and with potential to be bettered, yet, these 4 64x64 MULs are STILL scary.
As for those 4 bytes/cycle (on i7?!) they are kick-ass but in that respect 'xxhash256' kicks its ass quickly:
http://encode.ru/threads/1371-Filesystem-benchmark?p=33515&viewfull=1#post33515
Well, xxhash256 is a monster, but there are supermonsters as well, in the name of reaching higher Bytes-Per-Cycle let us see what a Core i3 laptop can achieve in L3/L2/L1 using XMM registers:
Core i3 2310M 2.1GHz laptop:
For reference my Core 2 T7500 2.2GHz laptop gave:
Seeing that 'Everest' gives 67011MB/s for L1 cache Read on this i3 and comparing it to 19877MB/s results in only 237% non-utilization.
Brutal improvement in XMM department over the preprevious generation (i3 being of 2nd), 8GB/s - huge difference indeed.
FNV1A_penumbra simply combines the short (192- bytes) and "long" 192[+] bytes long keys under one roof, first hashed by FNV1a-YoshimitsuTRIADii while the second by an unrolled XMM FNV1a whopper.
And one more thing, dispersion, a FNV1A-YoshimitsuTRIADii vs VHASH showdown would be interesting.
Currently I am hashing (using 27bit HT) 1+ trillion a la Knight-Tour 128 byte long keys, 220h later the result is:
In my view each and every 'text message' HT function should be tortured with this 128 (neither long nor short) key length.
I run a quick slash_hash test with SMHasher:
https://gist.github.com/anonymous/5926294
Not great.
Safe against DoS attacks, by Jean-Philippe Aumasson and Daniel J. Bernstein:
https://131002.net/siphash/
Another mentioned:
http://google-opensource.blogspot.co.at/2011/04/introducing-cityhash.html
http://code.google.com/p/cityhash/
And the set of tests maintained by Murmur authors: http://code.google.com/p/smhasher/
Thank you, SipHash looks interesting. I will include it in the benchmarks.
http://blog.booking.com/hardening-perls-hash-function.html
"Analysis done by the Perl5 Security Team suggests that One-At-A-Time-Hash is intrinsically more secure than MurmurHash. However to my knowledge, there is no peer-reviewed cryptanalysis to prove it.
There seems to be very little research into fast, robust, hash algorithms which are suitable for dynamic languages. Siphash is a notable exception and a step forward, but within the Perl community there seems to be consensus that it is currently too slow, at least in the recommended Siphash-2-4 incarnation. It is also problematic that its current implementation only supports 64 bit architectures. (No doubt this will improve over time, or perhaps even already has.)"
https://github.com/facebook/folly/blob/master/folly/container/F14.md
"F14 is a 14-way probing hash table that resolves collisions by double hashing. Up to 14 keys are stored in a chunk at a single hash table position. Vector instructions (SSE2 on x86_64, NEON on aarch64) are used to filter within a chunk; intra-chunk search takes only a handful of instructions. F14 refers to the fact that the algorithm Filters up to 14 keys at a time. This strategy allows the hash table to be operated at a high maximum load factor (12/14) while still keeping probe chains very short.
F14 provides compelling replacements for most of the hash tables we use in production at Facebook."
It seems t1ha superior to all of the above functions, both in speed and in quality.
Of course it could be used with Folly' F14 hash table.
https://github.com/PositiveTechnologies/t1ha
Thanks, I will test your function.
Hi Peter,
glad to share the latest-n-fastest FNV1A variant.
For a long time I knew how much more is out there, many coders shared very nice etudes, but my 'Yorikke' has something special, the ... Zennish approach embedded :P
Currently I am writing an insane matchfinder using B-trees while hashing millions of keys of order 4,6,8,10,12,14,16,18,36,64, thus a hasher of superhigh speed (FOR SMALL KEYS) is needed since the B-trees are constructed in multi-passes and billions of hash invocations of Yorikke are to be used. Latency is crucial, throughput is meh.
500 lines read
1024 elements in the table (10 bits)
13408 lines read
32768 elements in the table (15 bits)
3925 lines read
8192 elements in the table (13 bits)
500 lines read
1024 elements in the table (10 bits)
500 lines read
1024 elements in the table (10 bits)
500 lines read
1024 elements in the table (10 bits)
3228 lines read
8192 elements in the table (13 bits)
1842 lines read
4096 elements in the table (12 bits)
The feed:
https://github.com/wangyi-fudan/wyhash/issues/29#issuecomment-538078396
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/824947#comment-1946227
Dummy me, had to fix v2, now everything is OK, my excuse - yesterday, have been distracted the whole day.
So, here comes v3:
Peter, I see no ways to better the 32bit code hashers, so the fastest known to me 32bit hasher in 64bit code is:
https://forum.thefreedictionary.com/postsm1118964_MASAKARI--The-people-s-choice--General-Purpose-Grade--English-wordlist.aspx#1118964
What is the size of Bucket used in all the hash function mentioned in the graph named (hash function quality using red dragon book) in this blog?. It is requested to answer the query please.
Hello Mohit, thank you, it's a good question. The number of buckets (slots) were 2 * the number of items rounded to the next multiple of two. For example, in the "numbers" test there are 500 items in the table, so the number of buckets (hash table size) is 2 * 512 = 1024. Hope this helps
Thanks for your response
I just wanted to know the exact value for the uniform distribution of different hash function in the graph named (hash function quality using red dragon book), Can you please provide the exact values for which you plotted the graph.
It would be very helpful of you.
Can you please provide me with the exact value in the hash function quality graph just for numbers dataset? It would be really helpful of you.
Thanks in advance @Peter Kankowski.
Where’s the Wikipedia wordlist (or where does one get it from), and, more importantly, the OA test code?
I’m looking for a good OA spread/avalanche combo that’s cheap enough but doesn’t invoke UB or IB in C and is extremely portable (so it has to read by bytes, which makes a CRC surprisingly expensive, 2.39 to one-at-a-time’s 2.21 on my test borrowed windows system (don’t normally have one but your source is for it…))
Here are some interesting hash functions that can be added to your comparison suite:
I would like to present my latest research:
https://github.com/matteo65/ZedmeeHash a new hashing function with very interesting features
I would like to see this tried with sha-ni too