Hash: Note about the quality of the hash function.

In response to the thread starting at:

http://lists.boost.org/Archives/boost/2011/10/186476.php

[SVN r74856]
This commit is contained in:
Daniel James
2011-10-09 18:14:50 +00:00
parent 572712fac8
commit c7f13689ee
3 changed files with 67 additions and 3 deletions

View File

@@ -14,11 +14,16 @@
]
]
[def __issues__
[@http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1837.pdf
Library Extension Technical Report Issues List]]
[include:hash intro.qbk]
[include:hash tutorial.qbk]
[include:hash portability.qbk]
[include:hash disable.qbk]
[include:hash changes.qbk]
[include:hash rationale.qbk]
[xinclude ref.xml]
[include:hash links.qbk]
[include:hash thanks.qbk]

View File

@@ -18,9 +18,6 @@
[def __multi-index-short__ [@boost:/libs/multi_index/doc/index.html
Boost.MultiIndex]]
[def __bimap__ [@boost:/libs/bimap/index.html Boost.Bimap]]
[def __issues__
[@http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1837.pdf
Library Extension Technical Report Issues List]]
[def __hash-function__ [@http://en.wikipedia.org/wiki/Hash_function hash function]]
[def __hash-table__ [@http://en.wikipedia.org/wiki/Hash_table hash table]]
@@ -44,5 +41,12 @@ __issues__ (page 63), this adds support for:
* the standard containers.
* extending [classref boost::hash] for custom types.
[note
This hash function is designed to be used in containers based on
the STL and is not suitable as a general purpose hash function.
For more details see the [link hash.rationale rationale].
]
[endsect]

55
hash/doc/rationale.qbk Normal file
View File

@@ -0,0 +1,55 @@
[/ Copyright 2011 Daniel James.
/ Distributed under the Boost Software License, Version 1.0. (See accompanying
/ file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) ]
[section:rationale Rationale]
The rationale for the design can be found in the original design
[footnote issue 6.18 of the __issues__ (page 63)], but an issue that
occasionally comes up is the quality of the hash function, so that
demands some more attention.
Many hash functions strive to have little correlation between the input
and output values. They attempt to uniformally distribute the output
values for very similar inputs. This hash function makes no such
attempt. In fact, for integers, the result of the hash function is often
just the input value. So similar but different input values will result
in similar but different output values.
This means that it is not appropriate as a general hash function. For
example, a hash table may discard bits from the hash function resulting
in likely collisions, or might have poor collision resolution when hash
values are clustered together. In such cases this hash function will
preform poorly.
So why not implement a higher quality hash function? Well, the standard
makes no such guarantee, it just requires that the hashes of two
different values are unlikely to collide. So containers or algorithms
designed to work with the standard hash function will have to be
implemented to work well when the hash function's output is correlated
to its input. Since they are paying that cost it would be wasteful to
expand the effort to make a higher quality hash function.
If you do need a higher quality hash function, there are several options
available. One is to use a second hash on the output of this hash
function, such as [@http://www.concentric.net/~ttwang/tech/inthash.htm
Thomas Wang's hash function]. But for many types this might not work as
well as a hash algorithm tailored for the input.
For strings that are several fast, high quality hash functions
available, such as:
* [@http://burtleburtle.net/bob/hash/index.html Bob Jenkins' hash
functions]
* [@http://www.azillionmonkeys.com/qed/hash.html Paul Hsieh's hash
functions]
* [@http://code.google.com/p/cityhash/ Google's CityHash]
* [@http://code.google.com/p/smhasher/ MurmurHash3]
These may also be appropriate for hashing a binary representation of
your data - providing that all equal values have an equal
representation, which is not always the case (e.g. for floating point
values).
[endsect]