Update Notes section

This commit is contained in:
Peter Dimov
2022-09-20 21:06:09 +03:00
parent 607b73f1e0
commit f7e537d1a1

View File

@@ -10,6 +10,26 @@ https://www.boost.org/LICENSE_1_0.txt
= Design and Implementation Notes
:idprefix: notes_
== Quality of the Hash Function
Many hash functions strive to have little correlation between the input and
output values. They attempt to uniformally distribute the output values for
very similar inputs. This hash function makes no such attempt. In fact, for
integers, the result of the hash function is often just the input value. So
similar but different input values will often result in similar but different
output values. This means that it is not appropriate as a general hash
function. For example, a hash table may discard bits from the hash function
resulting in likely collisions, or might have poor collision resolution when
hash values are clustered together. In such cases this hash function will
perform poorly.
But the standard has no such requirement for the hash function, it just
requires that the hashes of two different values are unlikely to collide.
Containers or algorithms designed to work with the standard hash function will
have to be implemented to work well when the hash function's output is
correlated to its input. Since they are paying that cost a higher quality hash
function would be wasteful.
== The hash_value Customization Point
The way one customizes the standard `std::hash` function object for user
@@ -154,22 +174,45 @@ With this improved `hash_combine`, `boost::hash` for strings now passes the
https://github.com/aappleby/smhasher[SMHasher test suite] by Austin Appleby
(for a 64 bit `size_t`).
== Quality of the Hash Function
== hash_range
Many hash functions strive to have little correlation between the input and
output values. They attempt to uniformally distribute the output values for
very similar inputs. This hash function makes no such attempt. In fact, for
integers, the result of the hash function is often just the input value. So
similar but different input values will often result in similar but different
output values. This means that it is not appropriate as a general hash
function. For example, a hash table may discard bits from the hash function
resulting in likely collisions, or might have poor collision resolution when
hash values are clustered together. In such cases this hash function will
perform poorly.
The traditional implementation of `hash_range(seed, first, last)` has been
But the standard has no such requirement for the hash function, it just
requires that the hashes of two different values are unlikely to collide.
Containers or algorithms designed to work with the standard hash function will
have to be implemented to work well when the hash function's output is
correlated to its input. Since they are paying that cost a higher quality hash
function would be wasteful.
[source]
----
for( ; first != last; ++first )
{
boost::hash_combine<typename std::iterator_traits<It>::value_type>( seed, *first );
}
----
(the explicit template parameter is needed to support iterators with proxy
return types such as `std::vector<bool>::iterator`.)
This is logical, consistent and straightforward. In the common case where
`typename std::iterator_traits<It>::value_type` is `char` -- which it is
in the common case of `boost::hash<std::string>` -- this however leaves a
lot of performance on the table, because processing each `char` individually
is much less efficient than processing several in bulk.
In Boost 1.81, `hash_range` was changed to process elements of type `char`,
`signed char`, or `unsigned char`, four of a time. A `uint32_t` is composed
from `first[0]` to `first[3]`, and that `uint32_t` is fed to `hash_combine`.
In principle, when `size_t` is 64 bit, we could have used `uint64_t` instead.
We do not, because this allows producing an arbitrary hash value by choosing
the input bytes appropriately (because `hash_combine` is reversible.)
Allowing control only over 32 bits of the full 64 bit `size_t` value makes
these "chosen plaintext attacks" harder.
This is not as harmful to performance as it first appears, because the
input to `hash<string>` (e.g. the key in an unordered container) is often
short (9 to 13 bytes in some typical scenarios.)
Note that `hash_range` has also traditionally guaranteed that the same element
sequence yields the same hash value regardless of the iterator type. This
property remains valid after the changes to `char` range hashing. `hash_range`,
applied to the `char` sequence `{ 'a', 'b', 'c' }`, results in the same value
whether the sequence comes from `char[3]`, `std::string`, `std::deque<char>`,
or `std::list<char>`.