From f7e537d1a1ac8030e0c00b4a4e2e36abcc88f4e6 Mon Sep 17 00:00:00 2001
From: Peter Dimov <pdimov@gmail.com>
Date: Tue, 20 Sep 2022 21:06:09 +0300
Subject: [PATCH] Update Notes section

---
 doc/hash/notes.adoc | 77 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 60 insertions(+), 17 deletions(-)
diff --git a/doc/hash/notes.adoc b/doc/hash/notes.adoc
index 6cbde88..5f1d598 100644
--- a/doc/hash/notes.adoc
+++ b/doc/hash/notes.adoc
@@ -10,6 +10,26 @@ https://www.boost.org/LICENSE_1_0.txt
 = Design and Implementation Notes
 :idprefix: notes_
 
+== Quality of the Hash Function
+
+Many hash functions strive to have little correlation between the input and
+output values. They attempt to uniformally distribute the output values for
+very similar inputs. This hash function makes no such attempt. In fact, for
+integers, the result of the hash function is often just the input value. So
+similar but different input values will often result in similar but different
+output values. This means that it is not appropriate as a general hash
+function. For example, a hash table may discard bits from the hash function
+resulting in likely collisions, or might have poor collision resolution when
+hash values are clustered together. In such cases this hash function will
+perform poorly.
+
+But the standard has no such requirement for the hash function, it just
+requires that the hashes of two different values are unlikely to collide.
+Containers or algorithms designed to work with the standard hash function will
+have to be implemented to work well when the hash function's output is
+correlated to its input. Since they are paying that cost a higher quality hash
+function would be wasteful.
+
 == The hash_value Customization Point
 
 The way one customizes the standard `std::hash` function object for user
@@ -154,22 +174,45 @@ With this improved `hash_combine`, `boost::hash` for strings now passes the
 https://github.com/aappleby/smhasher[SMHasher test suite] by Austin Appleby
 (for a 64 bit `size_t`).
 
-== Quality of the Hash Function
+== hash_range
 
-Many hash functions strive to have little correlation between the input and
-output values. They attempt to uniformally distribute the output values for
-very similar inputs. This hash function makes no such attempt. In fact, for
-integers, the result of the hash function is often just the input value. So
-similar but different input values will often result in similar but different
-output values. This means that it is not appropriate as a general hash
-function. For example, a hash table may discard bits from the hash function
-resulting in likely collisions, or might have poor collision resolution when
-hash values are clustered together. In such cases this hash function will
-perform poorly.
+The traditional implementation of `hash_range(seed, first, last)` has been
 
-But the standard has no such requirement for the hash function, it just
-requires that the hashes of two different values are unlikely to collide.
-Containers or algorithms designed to work with the standard hash function will
-have to be implemented to work well when the hash function's output is
-correlated to its input. Since they are paying that cost a higher quality hash
-function would be wasteful.
+[source]
+----
+for( ; first != last; ++first )
+{
+    boost::hash_combine<typename std::iterator_traits<It>::value_type>( seed, *first );
+}
+----
+
+(the explicit template parameter is needed to support iterators with proxy
+return types such as `std::vector<bool>::iterator`.)
+
+This is logical, consistent and straightforward. In the common case where
+`typename std::iterator_traits<It>::value_type` is `char` -- which it is
+in the common case of `boost::hash<std::string>` -- this however leaves a
+lot of performance on the table, because processing each `char` individually
+is much less efficient than processing several in bulk.
+
+In Boost 1.81, `hash_range` was changed to process elements of type `char`,
+`signed char`, or `unsigned char`, four of a time. A `uint32_t` is composed
+from `first[0]` to `first[3]`, and that `uint32_t` is fed to `hash_combine`.
+
+In principle, when `size_t` is 64 bit, we could have used `uint64_t` instead.
+We do not, because this allows producing an arbitrary hash value by choosing
+the input bytes appropriately (because `hash_combine` is reversible.)
+
+Allowing control only over 32 bits of the full 64 bit `size_t` value makes
+these "chosen plaintext attacks" harder.
+
+This is not as harmful to performance as it first appears, because the
+input to `hash<string>` (e.g. the key in an unordered container) is often
+short (9 to 13 bytes in some typical scenarios.)
+
+Note that `hash_range` has also traditionally guaranteed that the same element
+sequence yields the same hash value regardless of the iterator type. This
+property remains valid after the changes to `char` range hashing. `hash_range`,
+applied to the `char` sequence `{ 'a', 'b', 'c' }`, results in the same value
+whether the sequence comes from `char[3]`, `std::string`, `std::deque<char>`,
+or `std::list<char>`.