Files
boost_unordered/doc/unordered/buckets.adoc
2022-06-14 08:50:28 -07:00

211 lines
9.5 KiB
Plaintext

[#buckets]
:idprefix: buckets_
:imagesdir: ../diagrams
= The Data Structure
The containers are made up of a number of 'buckets', each of which can contain
any number of elements. For example, the following diagram shows an <<unordered_set,unordered_set>> with 7 buckets containing 5 elements, `A`,
`B`, `C`, `D` and `E` (this is just for illustration, containers will typically
have more buckets).
image::buckets.png[]
In order to decide which bucket to place an element in, the container applies
the hash function, `Hash`, to the element's key (for `unordered_set` and
`unordered_multiset` the key is the whole element, but is referred to as the key
so that the same terminology can be used for sets and maps). This returns a
value of type `std::size_t`. `std::size_t` has a much greater range of values
then the number of buckets, so the container applies another transformation to
that value to choose a bucket to place the element in.
Retrieving the elements for a given key is simple. The same process is applied
to the key to find the correct bucket. Then the key is compared with the
elements in the bucket to find any elements that match (using the equality
predicate `Pred`). If the hash function has worked well the elements will be
evenly distributed amongst the buckets so only a small number of elements will
need to be examined.
There is <<hash_equality, more information on hash functions and
equality predicates in the next section>>.
You can see in the diagram that `A` & `D` have been placed in the same bucket.
When looking for elements in this bucket up to 2 comparisons are made, making
the search slower. This is known as a collision. To keep things fast we try to
keep collisions to a minimum.
[caption=, title='Table {counter:table-counter}. Methods for Accessing Buckets']
[cols="1,.^1", frame=all, grid=rows]
|===
|Method |Description
|`size_type bucket_count() const`
|The number of buckets.
|`size_type max_bucket_count() const`
|An upper bound on the number of buckets.
|`size_type bucket_size(size_type n) const`
|The number of elements in bucket `n`.
|`size_type bucket(key_type const& k) const`
|Returns the index of the bucket which would contain `k`.
|`local_iterator begin(size_type n)`
1.6+|Return begin and end iterators for bucket `n`.
|`local_iterator end(size_type n)`
|`const_local_iterator begin(size_type n) const`
|`const_local_iterator end(size_type n) const`
|`const_local_iterator cbegin(size_type n) const`
|`const_local_iterator cend(size_type n) const`
|===
== Controlling the number of buckets
As more elements are added to an unordered associative container, the number
of elements in the buckets will increase causing performance to degrade.
To combat this the containers increase the bucket count as elements are inserted.
You can also tell the container to change the bucket count (if required) by
calling `rehash`.
The standard leaves a lot of freedom to the implementer to decide how the
number of buckets is chosen, but it does make some requirements based on the
container's 'load factor', the average number of elements per bucket.
Containers also have a 'maximum load factor' which they should try to keep the
load factor below.
You can't control the bucket count directly but there are two ways to
influence it:
* Specify the minimum number of buckets when constructing a container or when calling `rehash`.
* Suggest a maximum load factor by calling `max_load_factor`.
`max_load_factor` doesn't let you set the maximum load factor yourself, it just
lets you give a _hint_. And even then, the standard doesn't actually
require the container to pay much attention to this value. The only time the
load factor is _required_ to be less than the maximum is following a call to
`rehash`. But most implementations will try to keep the number of elements
below the max load factor, and set the maximum load factor to be the same as
or close to the hint - unless your hint is unreasonably small or large.
[caption=, title='Table {counter:table-counter}. Methods for Controlling Bucket Size']
[cols="1,.^1", frame=all, grid=rows]
|===
|Method |Description
|`X(size_type n)`
|Construct an empty container with at least `n` buckets (`X` is the container type).
|`X(InputIterator i, InputIterator j, size_type n)`
|Construct an empty container with at least `n` buckets and insert elements from the range `[i, j)` (`X` is the container type).
|`float load_factor() const`
|The average number of elements per bucket.
|`float max_load_factor() const`
|Returns the current maximum load factor.
|`float max_load_factor(float z)`
|Changes the container's maximum load factor, using `z` as a hint.
|`void rehash(size_type n)`
|Changes the number of buckets so that there at least `n` buckets, and so that the load factor is less than the maximum load factor.
|===
== Iterator Invalidation
It is not specified how member functions other than `rehash` and `reserve` affect
the bucket count, although `insert` is only allowed to invalidate iterators
when the insertion causes the load factor to be greater than or equal to the
maximum load factor. For most implementations this means that `insert` will only
change the number of buckets when this happens. While iterators can be
invalidated by calls to `insert`, `rehash` and `reserve`, pointers and references to the
container's elements are never invalidated.
In a similar manner to using `reserve` for ``vector``s, it can be a good idea
to call `reserve` before inserting a large number of elements. This will get
the expensive rehashing out of the way and let you store iterators, safe in
the knowledge that they won't be invalidated. If you are inserting `n`
elements into container `x`, you could first call:
```
x.reserve(n);
```
Note:: `reserve(n)` reserves space for at least `n` elements, allocating enough buckets
so as to not exceed the maximum load factor.
+
Because the maximum load factor is defined as the number of elements divided by the total
number of available buckets, this function is logically equivalent to:
+
```
x.rehash(std::ceil(n / x.max_load_factor()))
```
+
See the <<unordered_map_rehash,reference for more details>> on the `rehash` function.
== Fast Closed Addressing Implementation
++++
<style>
.imageblock > .title {
text-align: inherit;
}
</style>
++++
Boost.Unordered sports one of the fastest implementations of closed addressing, also commonly known as https://en.wikipedia.org/wiki/Hash_table#Separate_chaining[separate chaining]. An example figure representing the data structure is below:
[#img-bucket-groups,.text-center]
.A simple bucket group approach
image::bucket-groups.png[align=center]
An array of "buckets" is allocated and each bucket in turn points to its own individual linked list. This makes meeting the standard requirements of bucket iteration straight-forward. Unfortunately, iteration of the entire container is often times slow using this layout as each bucket must be examined for occupancy, yielding a time complexity of `O(bucket_count() + size())` when the standard requires complexity to be `O(size())`.
Canonical standard implementations will wind up looking like the diagram below:
[.text-center]
.The canonical standard approach
image::singly-linked.png[align=center,link=../diagrams/singly-linked.png,window=_blank]
It's worth noting that this approach is only used by pass:[libc++] and pass:[libstdc++]; the MSVC Dinkumware implementation uses a different one. A more detailed analysis of the standard containers can be found http://bannalia.blogspot.com/2013/10/implementation-of-c-unordered.html[here].
This unusually laid out data structure is chosen to make iteration of the entire container efficient by inter-connecting all of the nodes into a singly-linked list. One might also notice that buckets point to the node _before_ the start of the bucket's elements. This is done so that removing elements from the list can be done efficiently without introducing the need for a doubly-linked list. Unfortunately, this data structure introduces a guaranteed extra indirection. For example, to access the first element of a bucket, something like this must be done:
```c++
auto const idx = get_bucket_idx(hash_function(key));
node* p = buckets[idx]; // first load
node* n = p->next; // second load
if (n && is_in_bucket(n, idx)) {
value_type const& v = *n; // third load
// ...
}
```
With a simple bucket group layout, this is all that must be done:
```c++
auto const idx = get_bucket_idx(hash_function(key));
node* n = buckets[idx]; // first load
if (n) {
value_type const& v = *n; // second load
// ...
}
```
In practice, the extra indirection can have a dramatic performance impact to common operations such as `insert`, `find` and `erase`. But to keep iteration of the container fast, Boost.Unordered introduces a novel data structure, a "bucket group". A bucket group is a fixed-width view of a subsection of the buckets array. It contains a bitmask (a `std::size_t`) which it uses to track occupancy of buckets and contains two pointers so that it can form a doubly-linked list with non-empty groups. An example diagram is below:
[#img-fca-layout]
.The new layout used by Boost
image::fca.png[align=center]
Thus container-wide iteration is turned into traversing the non-empty bucket groups (an operation with constant time complexity) which reduces the time complexity back to `O(size())`. In total, a bucket group is only 4 words in size and it views `sizeof(std::size_t) * CHAR_BIT` buckets meaning that for all common implementations, there's only 4 bits of space overhead per bucket introduced by the bucket groups.
For more information on implementation rationale, read the <<Implementation Rationale, corresponding section>>.