added section on hash quality, avalanching and stats

2025-09-25 15:20:57 +02:00 · 2024-05-07 20:13:43 +02:00
parent a527745ff8
commit d46e83296c
2 changed files with 157 additions and 0 deletions
--- a/doc/unordered.adoc
+++ b/doc/unordered.adoc
@@ -15,6 +15,7 @@ include::unordered/buckets.adoc[]
 include::unordered/hash_equality.adoc[]
 include::unordered/regular.adoc[]
 include::unordered/concurrent.adoc[]
+include::unordered/hash_quality.adoc[]
 include::unordered/compliance.adoc[]
 include::unordered/structures.adoc[]
 include::unordered/benchmarks.adoc[]
--- a/doc/unordered/hash_quality.adoc
+++ b/doc/unordered/hash_quality.adoc
@@ -0,0 +1,156 @@
+[#hash_quality]
+= Hash Quality
+
+:idprefix: hash_quality_
+
+In order to work properly, hash tables require that the supplied hash function
+be of __good quality__, roughly meaning that it uses its `std::size_t` output
+space as uniformly as possible, much like a random number generator would do
+—except, of course, that the value of a hash function is not random but strictly determined
+by its input argument.
+
+Closed-addressing containers in Boost.Unordered are fairly robust against
+hash functions with less-than-ideal quality, but open-addressing and concurrent
+containers are much more sensitive to this factor, and their performance can
+degrade dramatically if the hash function is not appropriate. In general, if
+you're using functions provided by or generated with link:../../../container_hash/index.html[Boost.Hash^],
+the quality will be adequate, but you have to be careful when using alternative
+hash algorithms.
+
+The rest of this section applies only to open-addressing and concurrent containers.
+
+== Hash Post-mixing and the Avalanching Property
+
+Even if your supplied hash function is of bad quality, chances are that
+the performance of Boost.Unordered containers will be acceptable, because the library
+executes an internal __post-mixing__ step that improves the statistical
+properties of the calculated hash values. This comes with an extra computational
+cost: if you'd like to opt out of post-mixing, annotate your hash function as
+follows:
+
+[source,c++]
+----
+struct my_string_hash_function
+{
+  using is_avalanching = void; // instruct Boost.Unordered to not use post-mixing
+
+  std::size_t operator()(const std::string& x) const
+  {
+    ...
+  }
+};
+----
+
+By setting the
+xref:#hash_traits_hash_is_avalanching[hash_is_avalanching] trait, we inform Boost.Unordered
+that `my_string_hash_function` is of sufficient quality to be used directly without
+any post-mixing safety net. This comes at the risk of degraded performance in the
+cases where the hash function is not as well-behaved as we've declared.
+
+== Container Statistics
+
+If we globally define the macro `BOOST_UNORDERED_ENABLE_STATS`, open-addressing and
+concurrent containers will calculate some internal statistics directly correlated to the
+quality of the hash function:
+
+[source,c++]
+----
+#define BOOST_UNORDERED_ENABLE_STATS
+#include <boost/unordered/unordered_map.hpp>
+
+...
+
+int main()
+{
+  boost::unordered_flat_map<std::string, int, my_string_hash> m;
+  ... // use m
+
+  auto stats = m.get_stats();
+  ... // inspect stats
+}
+----
+
+The `stats` object provide the following information:
+
+[%noheader, cols="1,1,1,1,~", frame=all, grid=rows]
+|===
+|`stats`||||
+
+||`.insertion`|||**Insertion operations**
+
+|||`.count`||Number of operations
+
+|||`.probe_length`||Probe length per operation
+
+||||`.average` +
+`.variance` +
+`.deviation`|
+
+||`.successful_lookup`|||**Lookup operations (element found)**
+
+|||`.count`||Number of operations
+
+|||`.probe_length`||Probe length per operation
+
+||||`.average` +
+`.variance` +
+`.deviation`|
+
+|||`.num_comparisons`||Elements compared to the key per operation
+
+||||`.average` +
+`.variance` +
+`.deviation`|
+
+||`.unsuccessful_lookup`|||**Lookup operations (element not found)**
+
+|||`.count`||Number of operations
+
+|||`.probe_length`||Probe length per operation
+
+||||`.average` +
+`.variance` +
+`.deviation`|
+
+|||`.num_comparisons`||Elements compared to the key per operation
+
+||||`.average` +
+`.variance` +
+`.deviation`|
+|===
+
+Statistics for three internal operations are maintained: insertions (without considering
+the previous lookup to determine that the key is not present yet), successful lookups
+and unsuccessful lookus. _Probe length_ is  the number of
+xref:#structures_open_addressing_containers[bucket groups] accessed per operation.
+If the hash function has good quality:
+
+* Average probe lengths should be close to 1.0.
+* The average number of comparisons per successful lookup should be close to 1.0 (that is,
+just the element found is checked).
+* The average number of comparisons per unsuccessful lookup should be close to 0.0. 
+
+A link:../../benchmark/string_stats.cpp[example^] is provided that displays container
+statistics for `boost::hash<std::string>`, an implementation of the
+https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function#FNV-1a_hash[FNV-1a hash^]
+and two ill-behaved custom hash functions that have been incorrectly marked as avalanching:
+
+[listing]
+----
+                   boost::unordered_flat_map:   319 ms
+                                   insertion: probe length 1.08771
+                           successful lookup: probe length 1.06206, num comparisons 1.02121
+                         unsuccessful lookup: probe length 1.12301, num comparisons 0.0388251
+           boost::unordered_flat_map, FNV-1a:   301 ms
+                                   insertion: probe length 1.09567
+                           successful lookup: probe length 1.06202, num comparisons 1.0227
+                         unsuccessful lookup: probe length 1.12195, num comparisons 0.040527
+boost::unordered_flat_map, slightly_bad_hash:   654 ms
+                                   insertion: probe length 1.03443
+                           successful lookup: probe length 1.04137, num comparisons 6.22152
+                         unsuccessful lookup: probe length 1.29334, num comparisons 11.0335
+         boost::unordered_flat_map, bad_hash: 12216 ms
+                                   insertion: probe length 699.218
+                           successful lookup: probe length 590.183, num comparisons 43.4886
+                         unsuccessful lookup: probe length 1361.65, num comparisons 75.238
+----