Update Notes section

2025-09-26 16:40:53 +02:00 · 2022-09-19 21:23:52 +03:00
parent 9035aa5485
commit e061b3c4c0
2 changed files with 147 additions and 16 deletions
--- a/doc/hash/intro.adoc
+++ b/doc/hash/intro.adoc
@@ -39,8 +39,9 @@ Out of the box, `boost::hash` supports
 * `std::variant`, `std::monostate`.
 `boost::hash` is extensible; it's possible for a user-defined type `X` to make
-iself hashable via `boost::hash<X>`. Many, if not most, Boost types already
+iself hashable via `boost::hash<X>` by defining an appropriate overload of the
-contain the necessary support.
+function `hash_value`. Many, if not most, Boost types already contain the
 necessary support.
 `boost::hash` meets the requirements for `std::hash` specified in the {cpp}11
 standard, namely, that for two different input values their corresponding hash
--- a/doc/hash/notes.adoc
+++ b/doc/hash/notes.adoc
@@ -10,6 +10,150 @@ https://www.boost.org/LICENSE_1_0.txt
 = Design and Implementation Notes
 :idprefix: notes_
 == The hash_value Customization Point
 The way one customizes the standard `std::hash` function object for user
 types is via a specialization. `boost::hash` chooses a different mechanism --
 an overload of a free function `hash_value` in the user namespace that is
 found via argument-dependent lookup.
 Both approaches have their pros and cons. Specializing the function object
 is stricter in that it only applies to the exact type, and not to derived
 or convertible types. Defining a function, on the other hand, is easier
 and more convenient, as it can be done directly in the type definition as
 an `inline` `friend`.
 The fact that overloads can be invoked via conversions did cause issues in
 an earlier iteration of the library that defined `hash_value` for all
 integral types separately, including `bool`. Especially under {cpp}03,
 which doesn't have `explicit` conversion operators, some types were
 convertible to `bool` to allow their being tested in e.g. `if` statements,
 which caused them to hash to 0 or 1, rarely what one expects or wants.
 This, however, was fixed by declaring the built-in `hash_value` overloads
 to be templates constrained on e.g. `std::is_integral` or its moral
 equivalent. This causes types convertible to an integral to no longer
 match, avoiding the problem.
 == hash_combine
 The initial implementation of the library was based on Issue 6.18 of the
 http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1837.pdf[Library Extension Technical Report Issues List]
 (pages 63-67) which proposed the following implementation of `hash_combine`:
 [source]
 ----
 template<class T>
 void hash_combine(size_t & seed, T const & v)
 {
    seed ^= hash_value(v) + (seed << 6) + (seed >> 2);
 }
 ----
 taken from the paper
 "https://people.eng.unimelb.edu.au/jzobel/fulltext/jasist03thz.pdf[Methods for Identifying Versioned and Plagiarised Documents]"
 by Timothy C. Hoad and Justin Zobel.
 During the Boost formal review, Dave Harris pointed out that this suffers
 from the so-called "zero trap"; if `seed` is initially 0, and all the
 inputs are 0 (or hash to 0), `seed` remains 0 no matter how many input
 values are combined.
 This is an undesirable property, because it causes containers of zeroes
 to have a zero hash value regardless of their sizes.
 To fix this, the arbitrary constant `0x9e3779b9` (the golden ratio in a
 32 bit fixed point representation) was added to the computation, yielding
 [source]
 ----
 template<class T>
 void hash_combine(size_t & seed, T const & v)
 {
    seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
 }
 ----
 This is what shipped in the first release of Boost containing the library.
 This function was a reasonable compromise between quality and speed for its
 time, when the input consisted of ``char``s, but it's less suitable for
 combining arbitrary `size_t` inputs.
 In Boost 1.56, it was replaced by functions derived from Austin Appleby's
 https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash2.cpp#L57-L62[MurmurHash2 hash function round].
 In Boost 1.81, it was changed again -- to the equivalent of
 `mix(seed + 0x9e3779b9 + hash_value(v))`, where `mix(x)` is a high quality
 mixing function that is a bijection over the `size_t` values, of the form
 [source]
 ----
 x ^= x >> k1;
 x *= m1;
 x ^= x >> k2;
 x *= m2;
 x ^= x >> k3;
 ----
 This type of mixing function was originally devised by Austin Appleby as
 the "final mix" part of his MurmurHash3 hash function. He used
 [source]
 ----
 x ^= x >> 33;
 x *= 0xff51afd7ed558ccd;
 x ^= x >> 33;
 x *= 0xc4ceb9fe1a85ec53;
 x ^= x >> 33;
 ----
 as the https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash2.cpp#L57-L62[64 bit function `fmix64`] and
 [source]
 ----
 x ^= x >> 16;
 x *= 0x85ebca6b;
 x ^= x >> 13;
 x *= 0xc2b2ae35;
 x ^= x >> 16;
 ----
 as the https://github.com/aappleby/smhasher/blob/61a0530f28277f2e850bfc39600ce61d02b518de/src/MurmurHash3.cpp#L68-L77[32 bit function `fmix32`].
 Several improvements of the 64 bit function have been subsequently proposed,
 by https://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html[David Stafford],
 https://mostlymangling.blogspot.com/2019/12/stronger-better-morer-moremur-better.html[Pelle Evensen],
 and http://jonkagstrom.com/mx3/mx3_rev2.html[Jon Maiga]. We currently use Jon
 Maiga's function
 [source]
 ----
 x ^= x >> 32;
 x *= 0xe9846af9b1a615d;
 x ^= x >> 32;
 x *= 0xe9846af9b1a615d;
 x ^= x >> 28;
 ----
 Under 32 bit, we use a mixing function proposed by "TheIronBorn" in a
 https://github.com/skeeto/hash-prospector/issues/19[Github issue] in
 the https://github.com/skeeto/hash-prospector[repository] of
 https://nullprogram.com/blog/2018/07/31/[Hash Prospector] by Chris Wellons:
 [source]
 ----
 x ^= x >> 16;
 x *= 0x21f0aaad;
 x ^= x >> 15;
 x *= 0x735a2d97;
 x ^= x >> 15;
 ----
 With this improved `hash_combine`, `boost::hash` for strings now passes the
 https://github.com/aappleby/smhasher[SMHasher test suite] by Austin Appleby
 (for a 64 bit `size_t`).
 == Quality of the Hash Function
 Many hash functions strive to have little correlation between the input and
@@ -29,17 +173,3 @@ Containers or algorithms designed to work with the standard hash function will
 have to be implemented to work well when the hash function's output is
 correlated to its input. Since they are paying that cost a higher quality hash
 function would be wasteful.
 For other use cases, if you do need a higher quality hash function, then
 neither the standard hash function or `boost::hash` are appropriate. There are
 several options available. One is to use a second hash on the output of this
 hash function, such as
 http://web.archive.org/web/20121102023700/http://www.concentric.net/~Ttwang/tech/inthash.htm[Thomas Wang's hash function].
 This this may not work as well as a hash algorithm tailored for the input.
 For strings there are several fast, high quality hash functions available
 (for example http://code.google.com/p/smhasher/[MurmurHash3] and
 http://code.google.com/p/cityhash/[Google's CityHash]), although they tend to
 be more machine specific. These may also be appropriate for hashing a binary
 representation of your data - providing that all equal values have an equal
 representation, which is not always the case (e.g. for floating point values).