uploaded current status

This commit is contained in:
joaquintides
2022-10-30 19:16:43 +01:00
parent 90f2f0f67d
commit 2068cf8d5b
7 changed files with 196 additions and 43 deletions

View File

@ -5,7 +5,7 @@
= The Data Structure
The containers are made up of a number of 'buckets', each of which can contain
any number of elements. For example, the following diagram shows an <<unordered_set,unordered_set>> with 7 buckets containing 5 elements, `A`,
any number of elements. For example, the following diagram shows a <<unordered_set,`boost::unordered_set`>> with 7 buckets containing 5 elements, `A`,
`B`, `C`, `D` and `E` (this is just for illustration, containers will typically
have more buckets).
@ -31,20 +31,34 @@ equality predicates in the next section>>.
You can see in the diagram that `A` & `D` have been placed in the same bucket.
When looking for elements in this bucket up to 2 comparisons are made, making
the search slower. This is known as a collision. To keep things fast we try to
the search slower. This is known as a *collision*. To keep things fast we try to
keep collisions to a minimum.
If instead of `boost::unordered_set` we had used <<unordered_flat_set,`boost::unordered_flat_set`>>, the
diagram would look as follows:
image::buckets oa.png[]
In open-addressing containers, buckets can hold at most one element; if a collision happens
(like is the case of `D` in the example), the element uses some other available bucket in
the vicinity of the original position. Given this simpler scenario, Boost.Unordered
open-addressing containers offer a very limited API for accessing buckets.
[caption=, title='Table {counter:table-counter}. Methods for Accessing Buckets']
[cols="1,.^1", frame=all, grid=rows]
|===
|Method |Description
2+^h| *All containers*
h|*Method* h|*Description*
|`size_type bucket_count() const`
|The number of buckets.
2+^h| *Closed-addressing containers only* +
`boost::unordered_[multi]set`, `boost::unordered_[multi]map`
h|*Method* h|*Description*
|`size_type max_bucket_count() const`
|An upper bound on the number of buckets.
|`size_type bucket_size(size_type n) const`
|The number of elements in bucket `n`.
@ -69,14 +83,14 @@ keep collisions to a minimum.
== Controlling the number of buckets
As more elements are added to an unordered associative container, the number
of elements in the buckets will increase causing performance to degrade.
of collisions will increase causing performance to degrade.
To combat this the containers increase the bucket count as elements are inserted.
You can also tell the container to change the bucket count (if required) by
calling `rehash`.
The standard leaves a lot of freedom to the implementer to decide how the
number of buckets is chosen, but it does make some requirements based on the
container's 'load factor', the average number of elements per bucket.
container's 'load factor', the number of elements divided by the number of buckets.
Containers also have a 'maximum load factor' which they should try to keep the
load factor below.
@ -97,7 +111,8 @@ or close to the hint - unless your hint is unreasonably small or large.
[caption=, title='Table {counter:table-counter}. Methods for Controlling Bucket Size']
[cols="1,.^1", frame=all, grid=rows]
|===
|Method |Description
2+^h| *All containers*
h|*Method* h|*Description*
|`X(size_type n)`
|Construct an empty container with at least `n` buckets (`X` is the container type).
@ -112,22 +127,45 @@ or close to the hint - unless your hint is unreasonably small or large.
|Returns the current maximum load factor.
|`float max_load_factor(float z)`
|Changes the container's maximum load factor, using `z` as a hint.
|Changes the container's maximum load factor, using `z` as a hint. +
**Open-addressing containers:** this function does nothing: users are not allowed to change the maximum load factor.
|`void rehash(size_type n)`
|Changes the number of buckets so that there at least `n` buckets, and so that the load factor is less than the maximum load factor.
2+^h| *Open-addressing containers only* +
`boost::unordered_flat_set`, `boost::unordered_flat_map`
h|*Method* h|*Description*
|`size_type max_load() const`
|Returns the maximum number of allowed elements in the container before rehash.
|===
A note on `max_load` for open-addressing containers: the maximum load will naturally decrease when
new insertions are performed, but _won't_ increase at the same rate when erasing: for instance,
adding 1,000 elements to a <<unordered_flat_map,`boost::unordered_flat_map`>> and then
erasing those 1,000 elements will typically reduce the maximum load by around 160 rather
than restoring it to its original value. This is done internally by Boost.Unordered in order
to keep its performance stable, and must be taken into account when planning for rehash-free insertions.
The maximum load will be reset to its theoretical maximum
(`max_load_factor() * bucket_count()`) right after `rehash`.
== Iterator Invalidation
It is not specified how member functions other than `rehash` and `reserve` affect
the bucket count, although `insert` is only allowed to invalidate iterators
when the insertion causes the load factor to be greater than or equal to the
maximum load factor. For most implementations this means that `insert` will only
change the number of buckets when this happens. While iterators can be
invalidated by calls to `insert`, `rehash` and `reserve`, pointers and references to the
container's elements are never invalidated.
the bucket count, although `insert` can only invalidate iterators
when the insertion causes the container's load to be greater than the maximum allowed.
For most implementations this means that `insert` will only
change the number of buckets when this happens. Iterators can be
invalidated by calls to `insert`, `rehash` and `reserve`.
As for pointers and references,
they are never invalidated for closed-addressing containers (`boost::unordered_[multi]set`, `boost::unordered_[multi]map`),
but they will when rehashing occurs for open-addressing
`boost::unordered_flat_set` and `boost::unordered_flat_map`: this is because
these containers store elements directly into their holding buckets, so
when allocating a new bucket array the elements must be transferred by means of move construction.
In a similar manner to using `reserve` for ``vector``s, it can be a good idea
to call `reserve` before inserting a large number of elements. This will get

View File

@ -25,19 +25,22 @@
|No equivalent. Since the elements aren't ordered `lower_bound` and `upper_bound` would be meaningless.
|`equal_range(k)` returns an empty range at the position that `k` would be inserted if `k` isn't present in the container.
|`equal_range(k)` returns a range at the end of the container if `k` isn't present in the container. It can't return a positioned range as `k` could be inserted into multiple place. To find out the bucket that `k` would be inserted into use `bucket(k)`. But remember that an insert can cause the container to rehash - meaning that the element can be inserted into a different bucket.
|`equal_range(k)` returns a range at the end of the container if `k` isn't present in the container. It can't return a positioned range as `k` could be inserted into multiple place. +
**Closed-addressing containers:** To find out the bucket that `k` would be inserted into use `bucket(k)`. But remember that an insert can cause the container to rehash - meaning that the element can be inserted into a different bucket.
|`iterator`, `const_iterator` are of the bidirectional category.
|`iterator`, `const_iterator` are of at least the forward category.
|Iterators, pointers and references to the container's elements are never invalidated.
|<<buckets_iterator_invalidation,Iterators can be invalidated by calls to insert or rehash>>. Pointers and references to the container's elements are never invalidated.
|<<buckets_iterator_invalidation,Iterators can be invalidated by calls to insert or rehash>>. +
**Closed-addressing containers:** Pointers and references to the container's elements are never invalidated. +
**Open-addressing containers:** Pointers and references to the container's elements are invalidated when rehashing occurs.
|Iterators iterate through the container in the order defined by the comparison object.
|Iterators iterate through the container in an arbitrary order, that can change as elements are inserted, although equivalent elements are always adjacent.
|No equivalent
|Local iterators can be used to iterate through individual buckets. (The order of local iterators and iterators aren't required to have any correspondence.)
|**Closed-addressing containers:** Local iterators can be used to iterate through individual buckets. (The order of local iterators and iterators aren't required to have any correspondence.)
|Can be compared using the `==`, `!=`, `<`, `\<=`, `>`, `>=` operators.
|Can be compared using the `==` and `!=` operators.
@ -45,9 +48,6 @@
|
|When inserting with a hint, implementations are permitted to ignore the hint.
|`erase` never throws an exception
|The containers' hash or predicate function can throw exceptions from `erase`.
|===
---

View File

@ -5,13 +5,15 @@
:cpp: C++
== Closed-addressing containers: unordered_[multi]set, unordered_[multi]map
The intent of Boost.Unordered is to implement a close (but imperfect)
implementation of the {cpp}17 standard, that will work with {cpp}98 upwards.
The wide compatibility does mean some comprimises have to be made.
With a compiler and library that fully support {cpp}11, the differences should
be minor.
== Move emulation
=== Move emulation
Support for move semantics is implemented using Boost.Move. If rvalue
references are available it will use them, but if not it uses a close,
@ -23,7 +25,7 @@ but imperfect emulation. On such compilers:
* The containers themselves are not movable.
* Argument forwarding is not perfect.
== Use of allocators
=== Use of allocators
{cpp}11 introduced a new allocator system. It's backwards compatible due to
the lax requirements for allocators in the old standard, but might need
@ -56,7 +58,7 @@ Due to imperfect move emulation, some assignments might check
`propagate_on_container_copy_assignment` on some compilers and
`propagate_on_container_move_assignment` on others.
== Construction/Destruction using allocators
=== Construction/Destruction using allocators
The following support is required for full use of {cpp}11 style
construction/destruction:
@ -76,7 +78,7 @@ constructing a `std::pair` using `boost::tuple` (see <<compliance_pairs,below>>)
When support is not available `allocator_traits::construct` and
`allocator_traits::destroy` are never called.
== Pointer Traits
=== Pointer Traits
`pointer_traits` aren't used. Instead, pointer types are obtained from
rebound allocators, this can cause problems if the allocator can't be
@ -84,7 +86,7 @@ used with incomplete types. If `const_pointer` is not defined in the
allocator, `boost::pointer_to_other<pointer, const value_type>::type`
is used to obtain a const pointer.
== Pairs
=== Pairs
Since the containers use `std::pair` they're limited to the version
from the current standard library. But since {cpp}11 ``std::pair``'s
@ -105,7 +107,7 @@ Older drafts of the standard also supported variadic constructors
for `std::pair`, where the first argument would be used for the
first part of the pair, and the remaining for the second part.
== Miscellaneous
=== Miscellaneous
When swapping, `Pred` and `Hash` are not currently swapped by calling
`swap`, their copy constructors are used. As a consequence when swapping
@ -114,3 +116,28 @@ an exception may be thrown from their copy constructor.
Variadic constructor arguments for `emplace` are only used when both
rvalue references and variadic template parameters are available.
Otherwise `emplace` can only take up to 10 constructors arguments.
== Open-addressing containers: unordered_flat_set, unordered_flat_map
The C++ standard does not currently provide any open-addressing container
specification to adhere to, so `boost::unordered_flat_set` and
`boost::unordered_flat_map` take inspiration from `std::unordered_set` and
`std::unordered_map`, respectively, and depart from their interface where
convenient or as dictated by their internal data structure, which is
radically different from that imposed by the standard (closed addressing, node based).
`unordered_flat_set` and `unordered_flat_map` only work with reasonably
compliant C++11 (or later) compilers. Language-level features such as move semantics
and variadic template parameters are then not emulated.
`unordered_flat_set` and `unordered_flat_map` are fully https://en.cppreference.com/w/cpp/named_req/AllocatorAwareContainer[AllocatorAware^].
The main differences with C++ unordered associative containers are:
* `value_type` must be move-constructible.
* Pointer stability is not kept under rehashing.
* `begin()` is not constant-time.
* `erase(iterator)` returns `void` instead of an iterator to the following element.
* There is no API for bucket handling (except `bucket_count`) or node extraction/insertion.
* The maximum load factor of the container is managed internally and can't be set by the user. The maximum load,
exposed through the public function `max_load`, can not increase monotonically with the number of erasures.

View File

@ -11,4 +11,8 @@ Copyright (C) 2005-2008 Daniel James
Copyright (C) 2022 Christian Mazakas
Copyright (C) 2022 Joaqu&iacute;n M L&oacute;pez Mu&ntilde;oz
Copyright (C) 2022 Peter Dimov
Distributed under the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)

View File

@ -29,14 +29,14 @@ struct hash_is_avalanching;
A hash function is said to have the _avalanching property_ if small changes in the input translate to
large changes in the returned hash code &#8212;ideally, flipping one bit in the representation of
the input value results in each bit of the hash code flipping with probability 50%. This property is
critical for the proper behavior of open-addressing hash containers.
the input value results in each bit of the hash code flipping with probability 50%. Approaching
this property is critical for the proper behavior of open-addressing hash containers.
`hash_is_avalanching<Hash>` derives from `std::true_type` if `Hash::is_avalanching` is a valid type,
and derives from `std::false_type` otherwise.
`hash_is_avalanching<Hash>::value` is `true` if `Hash::is_avalanching` is a valid type,
and `false` otherwise.
Users can then declare a hash function `Hash` as avalanching either by embedding an `is_avalanching` typedef
into the definition of `Hash`, or directly by specializing `hash_is_avalanching<Hash>` to derive from
`std::true_type`.
into the definition of `Hash`, or directly by specializing `hash_is_avalanching<Hash>` to a class with
an embedded compile-time constant `value` set to `true`.
xref:unordered_flat_set[`boost::unordered_flat_set`] and xref:unordered_flat_map[`boost::unordered_flat_map`]
use the provided hash function `Hash` as-is if `hash_is_avalanching<Hash>::value` is `true`; otherwise, they

View File

@ -18,12 +18,12 @@ or isn't practical. In contrast, a hash table only needs an equality function
and a hash function for the key.
With this in mind, unordered associative containers were added to the {cpp}
standard. This is an implementation of the containers described in {cpp}11,
standard. Boost.Unordered provides an implementation of the containers described in {cpp}11,
with some <<compliance,deviations from the standard>> in
order to work with non-{cpp}11 compilers and libraries.
`unordered_set` and `unordered_multiset` are defined in the header
`<boost/unordered_set.hpp>`
`<boost/unordered/unordered_set.hpp>`
[source,c++]
----
namespace boost {
@ -44,7 +44,7 @@ namespace boost {
----
`unordered_map` and `unordered_multimap` are defined in the header
`<boost/unordered_map.hpp>`
`<boost/unordered/unordered_map.hpp>`
[source,c++]
----
@ -65,10 +65,51 @@ namespace boost {
}
----
When using Boost.TR1, these classes are included from `<unordered_set>` and
`<unordered_map>`, with the classes added to the `std::tr1` namespace.
These containers, and all other implementations of standard unordered associative
containers, use an approach to its internal data structure design called
*closed addressing*. Starting in Boost 1.81, Boost.Unordered also provides containers
`boost::unordered_flat_set` and `boost::unordered_flat_map`, which use a
different data structure strategy commonly known as *open addressing* and depart in
a small number of ways from the standard so as to offer much better performance
in exchange (more than 2 times faster in typical scenarios):
The containers are used in a similar manner to the normal associative
[source,c++]
----
// #include <boost/unordered/unordered_flat_set.hpp>
//
// Note: no multiset version
namespace boost {
template <
class Key,
class Hash = boost::hash<Key>,
class Pred = std::equal_to<Key>,
class Alloc = std::allocator<Key> >
class unordered_flat_set;
}
----
[source,c++]
----
// #include <boost/unordered/unordered_flat_map.hpp>
//
// Note: no multimap version
namespace boost {
template <
class Key, class Mapped,
class Hash = boost::hash<Key>,
class Pred = std::equal_to<Key>,
class Alloc = std::allocator<std::pair<Key const, Mapped> > >
class unordered_flat_map;
}
----
`boost::unordered_flat_set` and `boost::unordered_flat_map` require a
reasonably compliant C++11 compiler.
Boost.Unordered containers are used in a similar manner to the normal associative
containers:
[source,cpp]
@ -87,7 +128,7 @@ But since the elements aren't ordered, the output of:
[source,c++]
----
BOOST_FOREACH(map::value_type i, x) {
for(const map::value_type& i: x) {
std::cout<<i.first<<","<<i.second<<"\n";
}
----

View File

@ -4,15 +4,17 @@
= Implementation Rationale
The intent of this library is to implement the unordered
containers in the standard, so the interface was fixed. But there are
== boost::unordered_[multi]set and boost::unordered_[multi]map
These containers adhere to the standard requirements for unordered associative
containers, so the interface was fixed. But there are
still some implementation decisions to make. The priorities are
conformance to the standard and portability.
The http://en.wikipedia.org/wiki/Hash_table[Wikipedia article on hash tables^]
has a good summary of the implementation issues for hash tables in general.
== Data Structure
=== Data Structure
By specifying an interface for accessing the buckets of the container the
standard pretty much requires that the hash table uses chained addressing.
@ -37,7 +39,7 @@ bucket but there are some serious problems with this:
So chained addressing is used.
== Number of Buckets
=== Number of Buckets
There are two popular methods for choosing the number of buckets in a hash
table. One is to have a prime number of buckets, another is to use a power
@ -70,3 +72,44 @@ distribution.
Since release 1.80.0, prime numbers are chosen for the number of buckets in
tandem with sophisticated modulo arithmetic. This removes the need for "mixing"
the result of the user's hash function as was used for release 1.79.0.
== boost::unordered_flat_set and boost::unordered_flat_map
The C++ standard specification of unordered associative containers impose
severe limitations on permissible implementations, the most important being
that closed addressing is implicitly assumed. Slightly relaxing this specification
opens up the possibility of providing container variations taking full
advantage of open-addressing techniques.
The design of `boost::unordered_flat_set` and `boost::unordered_flat_map` has been
guided by Peter Dimov's https://pdimov.github.io/articles/unordered_dev_plan.html[Development Plan for Boost.Unordered^].
We discuss here the most relevant principles.
=== Hash function
Given its rich functionality and cross-platform interoperability,
`boost::hash` remains the default hash function of `boost::unordered_flat_set` and `boost::unordered_flat_map`.
As it happens, `boost::hash` for integral and other basic types does not provide
the good statistical properties required by open addressing; to cope with this,
we implement a post-mixing stage:
* 64-bit architectures: we use the `xmx` function defined in
Jon Maiga's http://jonkagstrom.com/bit-mixer-construction/index.html[The construct of a bit mixer^].
* 32-bit architectures: the mixer used was selected from a set generated with https://github.com/skeeto/hash-prospector[Hash Function Prospector^]
as the best overall performer in our internal benchmarks. Score assigned by Hash Prospector is 333.7934929677524.
When using a hash function directly suitable for open addressing, post-mixing can be opted out by via a dedicated <<hash_traits_hash_is_avalanching,`hash_is_avalanching`>>trait.
`boost::hash` specializations for string types are marked as avalanching.
=== Platform interoperability
The observable behavior of `boost::unordered_flat_set` and `boost::unordered_flat_map` is deterministically
identical across different compilers as long as their ``std::size_type``s are the same size and the user-provided
hash function and equality predicate are also interoperable
&#8212;this includes elements being ordered in exactly the same way for the same sequence of
operations.
Although the implementation internally uses SIMD technologies, such as https://en.wikipedia.org/wiki/SSE2[SSE2^]
and https://en.wikipedia.org/wiki/ARM_architecture_family#Advanced_SIMD_(NEON)[Neon^], when available,
this does not affect interoperatility. For instance, the behavior is the same
for Visual Studio on an Intel CPU with SSE2 in x64 and for GCC on an IBM s390x without any supported SIMD technology.