From c7f13689eea151df2baa0f95383a15ebd709a8e7 Mon Sep 17 00:00:00 2001 From: Daniel James Date: Sun, 9 Oct 2011 18:14:50 +0000 Subject: [PATCH] Hash: Note about the quality of the hash function. In response to the thread starting at: http://lists.boost.org/Archives/boost/2011/10/186476.php [SVN r74856] --- hash/doc/hash.qbk | 5 ++++ hash/doc/intro.qbk | 10 +++++--- hash/doc/rationale.qbk | 55 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 67 insertions(+), 3 deletions(-) create mode 100644 hash/doc/rationale.qbk diff --git a/hash/doc/hash.qbk b/hash/doc/hash.qbk index b734fb4..fe428f3 100644 --- a/hash/doc/hash.qbk +++ b/hash/doc/hash.qbk @@ -14,11 +14,16 @@ ] ] +[def __issues__ + [@http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1837.pdf + Library Extension Technical Report Issues List]] + [include:hash intro.qbk] [include:hash tutorial.qbk] [include:hash portability.qbk] [include:hash disable.qbk] [include:hash changes.qbk] +[include:hash rationale.qbk] [xinclude ref.xml] [include:hash links.qbk] [include:hash thanks.qbk] diff --git a/hash/doc/intro.qbk b/hash/doc/intro.qbk index b027719..076e997 100644 --- a/hash/doc/intro.qbk +++ b/hash/doc/intro.qbk @@ -18,9 +18,6 @@ [def __multi-index-short__ [@boost:/libs/multi_index/doc/index.html Boost.MultiIndex]] [def __bimap__ [@boost:/libs/bimap/index.html Boost.Bimap]] -[def __issues__ - [@http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1837.pdf - Library Extension Technical Report Issues List]] [def __hash-function__ [@http://en.wikipedia.org/wiki/Hash_function hash function]] [def __hash-table__ [@http://en.wikipedia.org/wiki/Hash_table hash table]] @@ -44,5 +41,12 @@ __issues__ (page 63), this adds support for: * the standard containers. * extending [classref boost::hash] for custom types. +[note +This hash function is designed to be used in containers based on +the STL and is not suitable as a general purpose hash function. +For more details see the [link hash.rationale rationale]. +] + + [endsect] diff --git a/hash/doc/rationale.qbk b/hash/doc/rationale.qbk new file mode 100644 index 0000000..8323dfa --- /dev/null +++ b/hash/doc/rationale.qbk @@ -0,0 +1,55 @@ + +[/ Copyright 2011 Daniel James. + / Distributed under the Boost Software License, Version 1.0. (See accompanying + / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) ] + +[section:rationale Rationale] + +The rationale for the design can be found in the original design +[footnote issue 6.18 of the __issues__ (page 63)], but an issue that +occasionally comes up is the quality of the hash function, so that +demands some more attention. + +Many hash functions strive to have little correlation between the input +and output values. They attempt to uniformally distribute the output +values for very similar inputs. This hash function makes no such +attempt. In fact, for integers, the result of the hash function is often +just the input value. So similar but different input values will result +in similar but different output values. + +This means that it is not appropriate as a general hash function. For +example, a hash table may discard bits from the hash function resulting +in likely collisions, or might have poor collision resolution when hash +values are clustered together. In such cases this hash function will +preform poorly. + +So why not implement a higher quality hash function? Well, the standard +makes no such guarantee, it just requires that the hashes of two +different values are unlikely to collide. So containers or algorithms +designed to work with the standard hash function will have to be +implemented to work well when the hash function's output is correlated +to its input. Since they are paying that cost it would be wasteful to +expand the effort to make a higher quality hash function. + +If you do need a higher quality hash function, there are several options +available. One is to use a second hash on the output of this hash +function, such as [@http://www.concentric.net/~ttwang/tech/inthash.htm +Thomas Wang's hash function]. But for many types this might not work as +well as a hash algorithm tailored for the input. + +For strings that are several fast, high quality hash functions +available, such as: + +* [@http://burtleburtle.net/bob/hash/index.html Bob Jenkins' hash + functions] +* [@http://www.azillionmonkeys.com/qed/hash.html Paul Hsieh's hash + functions] +* [@http://code.google.com/p/cityhash/ Google's CityHash] +* [@http://code.google.com/p/smhasher/ MurmurHash3] + +These may also be appropriate for hashing a binary representation of +your data - providing that all equal values have an equal +representation, which is not always the case (e.g. for floating point +values). + +[endsect] \ No newline at end of file