forked from boostorg/container_hash
Update Notes section
This commit is contained in:
@@ -10,6 +10,26 @@ https://www.boost.org/LICENSE_1_0.txt
|
|||||||
= Design and Implementation Notes
|
= Design and Implementation Notes
|
||||||
:idprefix: notes_
|
:idprefix: notes_
|
||||||
|
|
||||||
|
== Quality of the Hash Function
|
||||||
|
|
||||||
|
Many hash functions strive to have little correlation between the input and
|
||||||
|
output values. They attempt to uniformally distribute the output values for
|
||||||
|
very similar inputs. This hash function makes no such attempt. In fact, for
|
||||||
|
integers, the result of the hash function is often just the input value. So
|
||||||
|
similar but different input values will often result in similar but different
|
||||||
|
output values. This means that it is not appropriate as a general hash
|
||||||
|
function. For example, a hash table may discard bits from the hash function
|
||||||
|
resulting in likely collisions, or might have poor collision resolution when
|
||||||
|
hash values are clustered together. In such cases this hash function will
|
||||||
|
perform poorly.
|
||||||
|
|
||||||
|
But the standard has no such requirement for the hash function, it just
|
||||||
|
requires that the hashes of two different values are unlikely to collide.
|
||||||
|
Containers or algorithms designed to work with the standard hash function will
|
||||||
|
have to be implemented to work well when the hash function's output is
|
||||||
|
correlated to its input. Since they are paying that cost a higher quality hash
|
||||||
|
function would be wasteful.
|
||||||
|
|
||||||
== The hash_value Customization Point
|
== The hash_value Customization Point
|
||||||
|
|
||||||
The way one customizes the standard `std::hash` function object for user
|
The way one customizes the standard `std::hash` function object for user
|
||||||
@@ -154,22 +174,45 @@ With this improved `hash_combine`, `boost::hash` for strings now passes the
|
|||||||
https://github.com/aappleby/smhasher[SMHasher test suite] by Austin Appleby
|
https://github.com/aappleby/smhasher[SMHasher test suite] by Austin Appleby
|
||||||
(for a 64 bit `size_t`).
|
(for a 64 bit `size_t`).
|
||||||
|
|
||||||
== Quality of the Hash Function
|
== hash_range
|
||||||
|
|
||||||
Many hash functions strive to have little correlation between the input and
|
The traditional implementation of `hash_range(seed, first, last)` has been
|
||||||
output values. They attempt to uniformally distribute the output values for
|
|
||||||
very similar inputs. This hash function makes no such attempt. In fact, for
|
|
||||||
integers, the result of the hash function is often just the input value. So
|
|
||||||
similar but different input values will often result in similar but different
|
|
||||||
output values. This means that it is not appropriate as a general hash
|
|
||||||
function. For example, a hash table may discard bits from the hash function
|
|
||||||
resulting in likely collisions, or might have poor collision resolution when
|
|
||||||
hash values are clustered together. In such cases this hash function will
|
|
||||||
perform poorly.
|
|
||||||
|
|
||||||
But the standard has no such requirement for the hash function, it just
|
[source]
|
||||||
requires that the hashes of two different values are unlikely to collide.
|
----
|
||||||
Containers or algorithms designed to work with the standard hash function will
|
for( ; first != last; ++first )
|
||||||
have to be implemented to work well when the hash function's output is
|
{
|
||||||
correlated to its input. Since they are paying that cost a higher quality hash
|
boost::hash_combine<typename std::iterator_traits<It>::value_type>( seed, *first );
|
||||||
function would be wasteful.
|
}
|
||||||
|
----
|
||||||
|
|
||||||
|
(the explicit template parameter is needed to support iterators with proxy
|
||||||
|
return types such as `std::vector<bool>::iterator`.)
|
||||||
|
|
||||||
|
This is logical, consistent and straightforward. In the common case where
|
||||||
|
`typename std::iterator_traits<It>::value_type` is `char` -- which it is
|
||||||
|
in the common case of `boost::hash<std::string>` -- this however leaves a
|
||||||
|
lot of performance on the table, because processing each `char` individually
|
||||||
|
is much less efficient than processing several in bulk.
|
||||||
|
|
||||||
|
In Boost 1.81, `hash_range` was changed to process elements of type `char`,
|
||||||
|
`signed char`, or `unsigned char`, four of a time. A `uint32_t` is composed
|
||||||
|
from `first[0]` to `first[3]`, and that `uint32_t` is fed to `hash_combine`.
|
||||||
|
|
||||||
|
In principle, when `size_t` is 64 bit, we could have used `uint64_t` instead.
|
||||||
|
We do not, because this allows producing an arbitrary hash value by choosing
|
||||||
|
the input bytes appropriately (because `hash_combine` is reversible.)
|
||||||
|
|
||||||
|
Allowing control only over 32 bits of the full 64 bit `size_t` value makes
|
||||||
|
these "chosen plaintext attacks" harder.
|
||||||
|
|
||||||
|
This is not as harmful to performance as it first appears, because the
|
||||||
|
input to `hash<string>` (e.g. the key in an unordered container) is often
|
||||||
|
short (9 to 13 bytes in some typical scenarios.)
|
||||||
|
|
||||||
|
Note that `hash_range` has also traditionally guaranteed that the same element
|
||||||
|
sequence yields the same hash value regardless of the iterator type. This
|
||||||
|
property remains valid after the changes to `char` range hashing. `hash_range`,
|
||||||
|
applied to the `char` sequence `{ 'a', 'b', 'c' }`, results in the same value
|
||||||
|
whether the sequence comes from `char[3]`, `std::string`, `std::deque<char>`,
|
||||||
|
or `std::list<char>`.
|
||||||
|
Reference in New Issue
Block a user