From f7e537d1a1ac8030e0c00b4a4e2e36abcc88f4e6 Mon Sep 17 00:00:00 2001 From: Peter Dimov Date: Tue, 20 Sep 2022 21:06:09 +0300 Subject: [PATCH] Update Notes section --- doc/hash/notes.adoc | 77 +++++++++++++++++++++++++++++++++++---------- 1 file changed, 60 insertions(+), 17 deletions(-) diff --git a/doc/hash/notes.adoc b/doc/hash/notes.adoc index 6cbde88..5f1d598 100644 --- a/doc/hash/notes.adoc +++ b/doc/hash/notes.adoc @@ -10,6 +10,26 @@ https://www.boost.org/LICENSE_1_0.txt = Design and Implementation Notes :idprefix: notes_ +== Quality of the Hash Function + +Many hash functions strive to have little correlation between the input and +output values. They attempt to uniformally distribute the output values for +very similar inputs. This hash function makes no such attempt. In fact, for +integers, the result of the hash function is often just the input value. So +similar but different input values will often result in similar but different +output values. This means that it is not appropriate as a general hash +function. For example, a hash table may discard bits from the hash function +resulting in likely collisions, or might have poor collision resolution when +hash values are clustered together. In such cases this hash function will +perform poorly. + +But the standard has no such requirement for the hash function, it just +requires that the hashes of two different values are unlikely to collide. +Containers or algorithms designed to work with the standard hash function will +have to be implemented to work well when the hash function's output is +correlated to its input. Since they are paying that cost a higher quality hash +function would be wasteful. + == The hash_value Customization Point The way one customizes the standard `std::hash` function object for user @@ -154,22 +174,45 @@ With this improved `hash_combine`, `boost::hash` for strings now passes the https://github.com/aappleby/smhasher[SMHasher test suite] by Austin Appleby (for a 64 bit `size_t`). -== Quality of the Hash Function +== hash_range -Many hash functions strive to have little correlation between the input and -output values. They attempt to uniformally distribute the output values for -very similar inputs. This hash function makes no such attempt. In fact, for -integers, the result of the hash function is often just the input value. So -similar but different input values will often result in similar but different -output values. This means that it is not appropriate as a general hash -function. For example, a hash table may discard bits from the hash function -resulting in likely collisions, or might have poor collision resolution when -hash values are clustered together. In such cases this hash function will -perform poorly. +The traditional implementation of `hash_range(seed, first, last)` has been -But the standard has no such requirement for the hash function, it just -requires that the hashes of two different values are unlikely to collide. -Containers or algorithms designed to work with the standard hash function will -have to be implemented to work well when the hash function's output is -correlated to its input. Since they are paying that cost a higher quality hash -function would be wasteful. +[source] +---- +for( ; first != last; ++first ) +{ + boost::hash_combine::value_type>( seed, *first ); +} +---- + +(the explicit template parameter is needed to support iterators with proxy +return types such as `std::vector::iterator`.) + +This is logical, consistent and straightforward. In the common case where +`typename std::iterator_traits::value_type` is `char` -- which it is +in the common case of `boost::hash` -- this however leaves a +lot of performance on the table, because processing each `char` individually +is much less efficient than processing several in bulk. + +In Boost 1.81, `hash_range` was changed to process elements of type `char`, +`signed char`, or `unsigned char`, four of a time. A `uint32_t` is composed +from `first[0]` to `first[3]`, and that `uint32_t` is fed to `hash_combine`. + +In principle, when `size_t` is 64 bit, we could have used `uint64_t` instead. +We do not, because this allows producing an arbitrary hash value by choosing +the input bytes appropriately (because `hash_combine` is reversible.) + +Allowing control only over 32 bits of the full 64 bit `size_t` value makes +these "chosen plaintext attacks" harder. + +This is not as harmful to performance as it first appears, because the +input to `hash` (e.g. the key in an unordered container) is often +short (9 to 13 bytes in some typical scenarios.) + +Note that `hash_range` has also traditionally guaranteed that the same element +sequence yields the same hash value regardless of the iterator type. This +property remains valid after the changes to `char` range hashing. `hash_range`, +applied to the `char` sequence `{ 'a', 'b', 'c' }`, results in the same value +whether the sequence comes from `char[3]`, `std::string`, `std::deque`, +or `std::list`.