mirror of
https://github.com/boostorg/container_hash.git
synced 2025-08-04 15:04:39 +02:00
Update Notes section
This commit is contained in:
@@ -39,8 +39,9 @@ Out of the box, `boost::hash` supports
|
|||||||
* `std::variant`, `std::monostate`.
|
* `std::variant`, `std::monostate`.
|
||||||
|
|
||||||
`boost::hash` is extensible; it's possible for a user-defined type `X` to make
|
`boost::hash` is extensible; it's possible for a user-defined type `X` to make
|
||||||
iself hashable via `boost::hash<X>`. Many, if not most, Boost types already
|
iself hashable via `boost::hash<X>` by defining an appropriate overload of the
|
||||||
contain the necessary support.
|
function `hash_value`. Many, if not most, Boost types already contain the
|
||||||
|
necessary support.
|
||||||
|
|
||||||
`boost::hash` meets the requirements for `std::hash` specified in the {cpp}11
|
`boost::hash` meets the requirements for `std::hash` specified in the {cpp}11
|
||||||
standard, namely, that for two different input values their corresponding hash
|
standard, namely, that for two different input values their corresponding hash
|
||||||
|
@@ -10,6 +10,150 @@ https://www.boost.org/LICENSE_1_0.txt
|
|||||||
= Design and Implementation Notes
|
= Design and Implementation Notes
|
||||||
:idprefix: notes_
|
:idprefix: notes_
|
||||||
|
|
||||||
|
== The hash_value Customization Point
|
||||||
|
|
||||||
|
The way one customizes the standard `std::hash` function object for user
|
||||||
|
types is via a specialization. `boost::hash` chooses a different mechanism --
|
||||||
|
an overload of a free function `hash_value` in the user namespace that is
|
||||||
|
found via argument-dependent lookup.
|
||||||
|
|
||||||
|
Both approaches have their pros and cons. Specializing the function object
|
||||||
|
is stricter in that it only applies to the exact type, and not to derived
|
||||||
|
or convertible types. Defining a function, on the other hand, is easier
|
||||||
|
and more convenient, as it can be done directly in the type definition as
|
||||||
|
an `inline` `friend`.
|
||||||
|
|
||||||
|
The fact that overloads can be invoked via conversions did cause issues in
|
||||||
|
an earlier iteration of the library that defined `hash_value` for all
|
||||||
|
integral types separately, including `bool`. Especially under {cpp}03,
|
||||||
|
which doesn't have `explicit` conversion operators, some types were
|
||||||
|
convertible to `bool` to allow their being tested in e.g. `if` statements,
|
||||||
|
which caused them to hash to 0 or 1, rarely what one expects or wants.
|
||||||
|
|
||||||
|
This, however, was fixed by declaring the built-in `hash_value` overloads
|
||||||
|
to be templates constrained on e.g. `std::is_integral` or its moral
|
||||||
|
equivalent. This causes types convertible to an integral to no longer
|
||||||
|
match, avoiding the problem.
|
||||||
|
|
||||||
|
== hash_combine
|
||||||
|
|
||||||
|
The initial implementation of the library was based on Issue 6.18 of the
|
||||||
|
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1837.pdf[Library Extension Technical Report Issues List]
|
||||||
|
(pages 63-67) which proposed the following implementation of `hash_combine`:
|
||||||
|
|
||||||
|
[source]
|
||||||
|
----
|
||||||
|
template<class T>
|
||||||
|
void hash_combine(size_t & seed, T const & v)
|
||||||
|
{
|
||||||
|
seed ^= hash_value(v) + (seed << 6) + (seed >> 2);
|
||||||
|
}
|
||||||
|
----
|
||||||
|
|
||||||
|
taken from the paper
|
||||||
|
"https://people.eng.unimelb.edu.au/jzobel/fulltext/jasist03thz.pdf[Methods for Identifying Versioned and Plagiarised Documents]"
|
||||||
|
by Timothy C. Hoad and Justin Zobel.
|
||||||
|
|
||||||
|
During the Boost formal review, Dave Harris pointed out that this suffers
|
||||||
|
from the so-called "zero trap"; if `seed` is initially 0, and all the
|
||||||
|
inputs are 0 (or hash to 0), `seed` remains 0 no matter how many input
|
||||||
|
values are combined.
|
||||||
|
|
||||||
|
This is an undesirable property, because it causes containers of zeroes
|
||||||
|
to have a zero hash value regardless of their sizes.
|
||||||
|
|
||||||
|
To fix this, the arbitrary constant `0x9e3779b9` (the golden ratio in a
|
||||||
|
32 bit fixed point representation) was added to the computation, yielding
|
||||||
|
|
||||||
|
[source]
|
||||||
|
----
|
||||||
|
template<class T>
|
||||||
|
void hash_combine(size_t & seed, T const & v)
|
||||||
|
{
|
||||||
|
seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
|
||||||
|
}
|
||||||
|
----
|
||||||
|
|
||||||
|
This is what shipped in the first release of Boost containing the library.
|
||||||
|
|
||||||
|
This function was a reasonable compromise between quality and speed for its
|
||||||
|
time, when the input consisted of ``char``s, but it's less suitable for
|
||||||
|
combining arbitrary `size_t` inputs.
|
||||||
|
|
||||||
|
In Boost 1.56, it was replaced by functions derived from Austin Appleby's
|
||||||
|
https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash2.cpp#L57-L62[MurmurHash2 hash function round].
|
||||||
|
|
||||||
|
In Boost 1.81, it was changed again -- to the equivalent of
|
||||||
|
`mix(seed + 0x9e3779b9 + hash_value(v))`, where `mix(x)` is a high quality
|
||||||
|
mixing function that is a bijection over the `size_t` values, of the form
|
||||||
|
|
||||||
|
[source]
|
||||||
|
----
|
||||||
|
x ^= x >> k1;
|
||||||
|
x *= m1;
|
||||||
|
x ^= x >> k2;
|
||||||
|
x *= m2;
|
||||||
|
x ^= x >> k3;
|
||||||
|
----
|
||||||
|
|
||||||
|
This type of mixing function was originally devised by Austin Appleby as
|
||||||
|
the "final mix" part of his MurmurHash3 hash function. He used
|
||||||
|
|
||||||
|
[source]
|
||||||
|
----
|
||||||
|
x ^= x >> 33;
|
||||||
|
x *= 0xff51afd7ed558ccd;
|
||||||
|
x ^= x >> 33;
|
||||||
|
x *= 0xc4ceb9fe1a85ec53;
|
||||||
|
x ^= x >> 33;
|
||||||
|
----
|
||||||
|
|
||||||
|
as the https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash2.cpp#L57-L62[64 bit function `fmix64`] and
|
||||||
|
|
||||||
|
[source]
|
||||||
|
----
|
||||||
|
x ^= x >> 16;
|
||||||
|
x *= 0x85ebca6b;
|
||||||
|
x ^= x >> 13;
|
||||||
|
x *= 0xc2b2ae35;
|
||||||
|
x ^= x >> 16;
|
||||||
|
----
|
||||||
|
|
||||||
|
as the https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash3.cpp#L68-L77[32 bit function `fmix32`].
|
||||||
|
|
||||||
|
Several improvements of the 64 bit function have been subsequently proposed,
|
||||||
|
by https://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html[David Stafford],
|
||||||
|
https://mostlymangling.blogspot.com/2019/12/stronger-better-morer-moremur-better.html[Pelle Evensen],
|
||||||
|
and http://jonkagstrom.com/mx3/mx3_rev2.html[Jon Maiga]. We currently use Jon
|
||||||
|
Maiga's function
|
||||||
|
|
||||||
|
[source]
|
||||||
|
----
|
||||||
|
x ^= x >> 32;
|
||||||
|
x *= 0xe9846af9b1a615d;
|
||||||
|
x ^= x >> 32;
|
||||||
|
x *= 0xe9846af9b1a615d;
|
||||||
|
x ^= x >> 28;
|
||||||
|
----
|
||||||
|
|
||||||
|
Under 32 bit, we use a mixing function proposed by "TheIronBorn" in a
|
||||||
|
https://github.com/skeeto/hash-prospector/issues/19[Github issue] in
|
||||||
|
the https://github.com/skeeto/hash-prospector[repository] of
|
||||||
|
https://nullprogram.com/blog/2018/07/31/[Hash Prospector] by Chris Wellons:
|
||||||
|
|
||||||
|
[source]
|
||||||
|
----
|
||||||
|
x ^= x >> 16;
|
||||||
|
x *= 0x21f0aaad;
|
||||||
|
x ^= x >> 15;
|
||||||
|
x *= 0x735a2d97;
|
||||||
|
x ^= x >> 15;
|
||||||
|
----
|
||||||
|
|
||||||
|
With this improved `hash_combine`, `boost::hash` for strings now passes the
|
||||||
|
https://github.com/aappleby/smhasher[SMHasher test suite] by Austin Appleby
|
||||||
|
(for a 64 bit `size_t`).
|
||||||
|
|
||||||
== Quality of the Hash Function
|
== Quality of the Hash Function
|
||||||
|
|
||||||
Many hash functions strive to have little correlation between the input and
|
Many hash functions strive to have little correlation between the input and
|
||||||
@@ -29,17 +173,3 @@ Containers or algorithms designed to work with the standard hash function will
|
|||||||
have to be implemented to work well when the hash function's output is
|
have to be implemented to work well when the hash function's output is
|
||||||
correlated to its input. Since they are paying that cost a higher quality hash
|
correlated to its input. Since they are paying that cost a higher quality hash
|
||||||
function would be wasteful.
|
function would be wasteful.
|
||||||
|
|
||||||
For other use cases, if you do need a higher quality hash function, then
|
|
||||||
neither the standard hash function or `boost::hash` are appropriate. There are
|
|
||||||
several options available. One is to use a second hash on the output of this
|
|
||||||
hash function, such as
|
|
||||||
http://web.archive.org/web/20121102023700/http://www.concentric.net/~Ttwang/tech/inthash.htm[Thomas Wang's hash function].
|
|
||||||
This this may not work as well as a hash algorithm tailored for the input.
|
|
||||||
|
|
||||||
For strings there are several fast, high quality hash functions available
|
|
||||||
(for example http://code.google.com/p/smhasher/[MurmurHash3] and
|
|
||||||
http://code.google.com/p/cityhash/[Google's CityHash]), although they tend to
|
|
||||||
be more machine specific. These may also be appropriate for hashing a binary
|
|
||||||
representation of your data - providing that all equal values have an equal
|
|
||||||
representation, which is not always the case (e.g. for floating point values).
|
|
||||||
|
Reference in New Issue
Block a user