Update Notes section

2022-09-20 21:06:09 +03:00
parent 607b73f1e0
commit f7e537d1a1
1 changed files with 60 additions and 17 deletions
--- a/doc/hash/notes.adoc
+++ b/doc/hash/notes.adoc
@@ -10,6 +10,26 @@ https://www.boost.org/LICENSE_1_0.txt
 = Design and Implementation Notes
 :idprefix: notes_
 == Quality of the Hash Function
 Many hash functions strive to have little correlation between the input and
 output values. They attempt to uniformally distribute the output values for
 very similar inputs. This hash function makes no such attempt. In fact, for
 integers, the result of the hash function is often just the input value. So
 similar but different input values will often result in similar but different
 output values. This means that it is not appropriate as a general hash
 function. For example, a hash table may discard bits from the hash function
 resulting in likely collisions, or might have poor collision resolution when
 hash values are clustered together. In such cases this hash function will
 perform poorly.
 But the standard has no such requirement for the hash function, it just
 requires that the hashes of two different values are unlikely to collide.
 Containers or algorithms designed to work with the standard hash function will
 have to be implemented to work well when the hash function's output is
 correlated to its input. Since they are paying that cost a higher quality hash
 function would be wasteful.
 == The hash_value Customization Point
 The way one customizes the standard `std::hash` function object for user
@@ -154,22 +174,45 @@ With this improved `hash_combine`, `boost::hash` for strings now passes the
 https://github.com/aappleby/smhasher[SMHasher test suite] by Austin Appleby
 (for a 64 bit `size_t`).
-== Quality of the Hash Function
+== hash_range
-Many hash functions strive to have little correlation between the input and
+The traditional implementation of `hash_range(seed, first, last)` has been
 output values. They attempt to uniformally distribute the output values for
 very similar inputs. This hash function makes no such attempt. In fact, for
 integers, the result of the hash function is often just the input value. So
 similar but different input values will often result in similar but different
 output values. This means that it is not appropriate as a general hash
 function. For example, a hash table may discard bits from the hash function
 resulting in likely collisions, or might have poor collision resolution when
 hash values are clustered together. In such cases this hash function will
 perform poorly.
-But the standard has no such requirement for the hash function, it just
+[source]
-requires that the hashes of two different values are unlikely to collide.
+----
-Containers or algorithms designed to work with the standard hash function will
+for( ; first != last; ++first )
-have to be implemented to work well when the hash function's output is
+{
-correlated to its input. Since they are paying that cost a higher quality hash
+    boost::hash_combine<typename std::iterator_traits<It>::value_type>( seed, *first );
-function would be wasteful.
+}
 ----
 (the explicit template parameter is needed to support iterators with proxy
 return types such as `std::vector<bool>::iterator`.)
 This is logical, consistent and straightforward. In the common case where
 `typename std::iterator_traits<It>::value_type` is `char` -- which it is
 in the common case of `boost::hash<std::string>` -- this however leaves a
 lot of performance on the table, because processing each `char` individually
 is much less efficient than processing several in bulk.
 In Boost 1.81, `hash_range` was changed to process elements of type `char`,
 `signed char`, or `unsigned char`, four of a time. A `uint32_t` is composed
 from `first[0]` to `first[3]`, and that `uint32_t` is fed to `hash_combine`.
 In principle, when `size_t` is 64 bit, we could have used `uint64_t` instead.
 We do not, because this allows producing an arbitrary hash value by choosing
 the input bytes appropriately (because `hash_combine` is reversible.)
 Allowing control only over 32 bits of the full 64 bit `size_t` value makes
 these "chosen plaintext attacks" harder.
 This is not as harmful to performance as it first appears, because the
 input to `hash<string>` (e.g. the key in an unordered container) is often
 short (9 to 13 bytes in some typical scenarios.)
 Note that `hash_range` has also traditionally guaranteed that the same element
 sequence yields the same hash value regardless of the iterator type. This
 property remains valid after the changes to `char` range hashing. `hash_range`,
 applied to the `char` sequence `{ 'a', 'b', 'c' }`, results in the same value
 whether the sequence comes from `char[3]`, `std::string`, `std::deque<char>`,
 or `std::list<char>`.