From e061b3c4c02b33b26d25401546a042e75919cd44 Mon Sep 17 00:00:00 2001 From: Peter Dimov Date: Mon, 19 Sep 2022 21:23:52 +0300 Subject: [PATCH] Update Notes section --- doc/hash/intro.adoc | 5 +- doc/hash/notes.adoc | 158 ++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 147 insertions(+), 16 deletions(-) diff --git a/doc/hash/intro.adoc b/doc/hash/intro.adoc index d63fdee..ae93416 100644 --- a/doc/hash/intro.adoc +++ b/doc/hash/intro.adoc @@ -39,8 +39,9 @@ Out of the box, `boost::hash` supports * `std::variant`, `std::monostate`. `boost::hash` is extensible; it's possible for a user-defined type `X` to make -iself hashable via `boost::hash`. Many, if not most, Boost types already -contain the necessary support. +iself hashable via `boost::hash` by defining an appropriate overload of the +function `hash_value`. Many, if not most, Boost types already contain the +necessary support. `boost::hash` meets the requirements for `std::hash` specified in the {cpp}11 standard, namely, that for two different input values their corresponding hash diff --git a/doc/hash/notes.adoc b/doc/hash/notes.adoc index d7daaf3..5c2f218 100644 --- a/doc/hash/notes.adoc +++ b/doc/hash/notes.adoc @@ -10,6 +10,150 @@ https://www.boost.org/LICENSE_1_0.txt = Design and Implementation Notes :idprefix: notes_ +== The hash_value Customization Point + +The way one customizes the standard `std::hash` function object for user +types is via a specialization. `boost::hash` chooses a different mechanism -- +an overload of a free function `hash_value` in the user namespace that is +found via argument-dependent lookup. + +Both approaches have their pros and cons. Specializing the function object +is stricter in that it only applies to the exact type, and not to derived +or convertible types. Defining a function, on the other hand, is easier +and more convenient, as it can be done directly in the type definition as +an `inline` `friend`. + +The fact that overloads can be invoked via conversions did cause issues in +an earlier iteration of the library that defined `hash_value` for all +integral types separately, including `bool`. Especially under {cpp}03, +which doesn't have `explicit` conversion operators, some types were +convertible to `bool` to allow their being tested in e.g. `if` statements, +which caused them to hash to 0 or 1, rarely what one expects or wants. + +This, however, was fixed by declaring the built-in `hash_value` overloads +to be templates constrained on e.g. `std::is_integral` or its moral +equivalent. This causes types convertible to an integral to no longer +match, avoiding the problem. + +== hash_combine + +The initial implementation of the library was based on Issue 6.18 of the +http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1837.pdf[Library Extension Technical Report Issues List] +(pages 63-67) which proposed the following implementation of `hash_combine`: + +[source] +---- +template +void hash_combine(size_t & seed, T const & v) +{ + seed ^= hash_value(v) + (seed << 6) + (seed >> 2); +} +---- + +taken from the paper +"https://people.eng.unimelb.edu.au/jzobel/fulltext/jasist03thz.pdf[Methods for Identifying Versioned and Plagiarised Documents]" +by Timothy C. Hoad and Justin Zobel. + +During the Boost formal review, Dave Harris pointed out that this suffers +from the so-called "zero trap"; if `seed` is initially 0, and all the +inputs are 0 (or hash to 0), `seed` remains 0 no matter how many input +values are combined. + +This is an undesirable property, because it causes containers of zeroes +to have a zero hash value regardless of their sizes. + +To fix this, the arbitrary constant `0x9e3779b9` (the golden ratio in a +32 bit fixed point representation) was added to the computation, yielding + +[source] +---- +template +void hash_combine(size_t & seed, T const & v) +{ + seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2); +} +---- + +This is what shipped in the first release of Boost containing the library. + +This function was a reasonable compromise between quality and speed for its +time, when the input consisted of ``char``s, but it's less suitable for +combining arbitrary `size_t` inputs. + +In Boost 1.56, it was replaced by functions derived from Austin Appleby's +https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash2.cpp#L57-L62[MurmurHash2 hash function round]. + +In Boost 1.81, it was changed again -- to the equivalent of +`mix(seed + 0x9e3779b9 + hash_value(v))`, where `mix(x)` is a high quality +mixing function that is a bijection over the `size_t` values, of the form + +[source] +---- +x ^= x >> k1; +x *= m1; +x ^= x >> k2; +x *= m2; +x ^= x >> k3; +---- + +This type of mixing function was originally devised by Austin Appleby as +the "final mix" part of his MurmurHash3 hash function. He used + +[source] +---- +x ^= x >> 33; +x *= 0xff51afd7ed558ccd; +x ^= x >> 33; +x *= 0xc4ceb9fe1a85ec53; +x ^= x >> 33; +---- + +as the https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash2.cpp#L57-L62[64 bit function `fmix64`] and + +[source] +---- +x ^= x >> 16; +x *= 0x85ebca6b; +x ^= x >> 13; +x *= 0xc2b2ae35; +x ^= x >> 16; +---- + +as the https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash3.cpp#L68-L77[32 bit function `fmix32`]. + +Several improvements of the 64 bit function have been subsequently proposed, +by https://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html[David Stafford], +https://mostlymangling.blogspot.com/2019/12/stronger-better-morer-moremur-better.html[Pelle Evensen], +and http://jonkagstrom.com/mx3/mx3_rev2.html[Jon Maiga]. We currently use Jon +Maiga's function + +[source] +---- +x ^= x >> 32; +x *= 0xe9846af9b1a615d; +x ^= x >> 32; +x *= 0xe9846af9b1a615d; +x ^= x >> 28; +---- + +Under 32 bit, we use a mixing function proposed by "TheIronBorn" in a +https://github.com/skeeto/hash-prospector/issues/19[Github issue] in +the https://github.com/skeeto/hash-prospector[repository] of +https://nullprogram.com/blog/2018/07/31/[Hash Prospector] by Chris Wellons: + +[source] +---- +x ^= x >> 16; +x *= 0x21f0aaad; +x ^= x >> 15; +x *= 0x735a2d97; +x ^= x >> 15; +---- + +With this improved `hash_combine`, `boost::hash` for strings now passes the +https://github.com/aappleby/smhasher[SMHasher test suite] by Austin Appleby +(for a 64 bit `size_t`). + == Quality of the Hash Function Many hash functions strive to have little correlation between the input and @@ -29,17 +173,3 @@ Containers or algorithms designed to work with the standard hash function will have to be implemented to work well when the hash function's output is correlated to its input. Since they are paying that cost a higher quality hash function would be wasteful. - -For other use cases, if you do need a higher quality hash function, then -neither the standard hash function or `boost::hash` are appropriate. There are -several options available. One is to use a second hash on the output of this -hash function, such as -http://web.archive.org/web/20121102023700/http://www.concentric.net/~Ttwang/tech/inthash.htm[Thomas Wang's hash function]. -This this may not work as well as a hash algorithm tailored for the input. - -For strings there are several fast, high quality hash functions available -(for example http://code.google.com/p/smhasher/[MurmurHash3] and -http://code.google.com/p/cityhash/[Google's CityHash]), although they tend to -be more machine specific. These may also be appropriate for hashing a binary -representation of your data - providing that all equal values have an equal -representation, which is not always the case (e.g. for floating point values).