refactored to modernize and improve flow

2023-05-18 20:18:58 +02:00
parent ff10b287e2
commit 3d640ac032
9 changed files with 354 additions and 425 deletions
--- a/doc/unordered.adoc
+++ b/doc/unordered.adoc
@ -13,9 +13,10 @@
 include::unordered/intro.adoc[]
 include::unordered/buckets.adoc[]
 include::unordered/hash_equality.adoc[]
-include::unordered/comparison.adoc[]
-include::unordered/concurrent_flat_map_intro.adoc[]
+include::unordered/regular.adoc[]
+include::unordered/concurrent.adoc[]
 include::unordered/compliance.adoc[]
+include::unordered/structures.adoc[]
 include::unordered/benchmarks.adoc[]
 include::unordered/rationale.adoc[]
 include::unordered/ref.adoc[]
--- a/doc/unordered/buckets.adoc
+++ b/doc/unordered/buckets.adoc
@ -2,9 +2,9 @@
 :idprefix: buckets_
 :imagesdir: ../diagrams

-= The Data Structure
+= Basics of Hash Tables

-The containers are made up of a number of 'buckets', each of which can contain
+The containers are made up of a number of _buckets_, each of which can contain
 any number of elements. For example, the following diagram shows a <<unordered_set,`boost::unordered_set`>> with 7 buckets containing 5 elements, `A`,
 `B`, `C`, `D` and `E` (this is just for illustration, containers will typically
 have more buckets).
@ -12,8 +12,7 @@ have more buckets).
 image::buckets.png[]

 In order to decide which bucket to place an element in, the container applies
-the hash function, `Hash`, to the element's key (for `unordered_set` and
-`unordered_multiset` the key is the whole element, but is referred to as the key
+the hash function, `Hash`, to the element's key (for sets the key is the whole element, but is referred to as the key
 so that the same terminology can be used for sets and maps). This returns a
 value of type `std::size_t`. `std::size_t` has a much greater range of values
 then the number of buckets, so the container applies another transformation to
@ -80,7 +79,7 @@ h|*Method* h|*Description*

 |===

-== Controlling the number of buckets
+== Controlling the Number of Buckets

 As more elements are added to an unordered associative container, the number
 of collisions will increase causing performance to degrade.
@ -90,8 +89,8 @@ calling `rehash`.

 The standard leaves a lot of freedom to the implementer to decide how the
 number of buckets is chosen, but it does make some requirements based on the
-container's 'load factor', the number of elements divided by the number of buckets.
-Containers also have a 'maximum load factor' which they should try to keep the
+container's _load factor_, the number of elements divided by the number of buckets.
+Containers also have a _maximum load factor_ which they should try to keep the
 load factor below.

 You can't control the bucket count directly but there are two ways to
@ -133,9 +132,10 @@ h|*Method* h|*Description*
 |`void rehash(size_type n)`
 |Changes the number of buckets so that there at least `n` buckets, and so that the load factor is less than the maximum load factor.

-2+^h| *Open-addressing containers only* +
+2+^h| *Open-addressing and concurrent containers only* +
 `boost::unordered_flat_set`, `boost::unordered_flat_map` +
 `boost::unordered_node_set`, `boost::unordered_node_map` +
+`boost::concurrent_flat_map`
 h|*Method* h|*Description*

 |`size_type max_load() const`
@ -143,7 +143,7 @@ h|*Method* h|*Description*

 |===

-A note on `max_load` for open-addressing containers: the maximum load will be 
+A note on `max_load` for open-addressing and concurrent containers: the maximum load will be 
 (`max_load_factor() * bucket_count()`) right after `rehash` or on container creation, but may
 slightly decrease when erasing elements in high-load situations. For instance, if we
 have a <<unordered_flat_map,`boost::unordered_flat_map`>> with `size()` almost
@ -151,216 +151,4 @@ at `max_load()` level and then erase 1,000 elements, `max_load()` may decrease b
 few dozen elements. This is done internally by Boost.Unordered in order
 to keep its performance stable, and must be taken into account when planning for rehash-free insertions.

-== Iterator Invalidation

-It is not specified how member functions other than `rehash` and `reserve` affect
-the bucket count, although `insert` can only invalidate iterators
-when the insertion causes the container's load to be greater than the maximum allowed.
-For most implementations this means that `insert` will only
-change the number of buckets when this happens. Iterators can be
-invalidated by calls to `insert`, `rehash` and `reserve`.
-
-As for pointers and references,
-they are never invalidated for node-based containers 
-(`boost::unordered_[multi]set`, `boost::unordered_[multi]map`, `boost::unordered_node_set`, `boost::unordered_node_map`),
-but they will when rehashing occurs for
-`boost::unordered_flat_set` and `boost::unordered_flat_map`: this is because
-these containers store elements directly into their holding buckets, so
-when allocating a new bucket array the elements must be transferred by means of move construction.
-
-In a similar manner to using `reserve` for ``vector``s, it can be a good idea
-to call `reserve` before inserting a large number of elements. This will get
-the expensive rehashing out of the way and let you store iterators, safe in
-the knowledge that they won't be invalidated. If you are inserting `n`
-elements into container `x`, you could first call:
-
-```
-x.reserve(n);
-```
-
-Note:: `reserve(n)` reserves space for at least `n` elements, allocating enough buckets
-so as to not exceed the maximum load factor.
-+
-Because the maximum load factor is defined as the number of elements divided by the total
-number of available buckets, this function is logically equivalent to:
-+
-```
-x.rehash(std::ceil(n / x.max_load_factor()))
-```
-+
-See the <<unordered_map_rehash,reference for more details>> on the `rehash` function.
-
-== Fast Closed Addressing Implementation
-
-++++
-<style>
-  .imageblock > .title {
-    text-align: inherit;
-  }
-</style>
-++++
-
-Boost.Unordered sports one of the fastest implementations of closed addressing, also commonly known as https://en.wikipedia.org/wiki/Hash_table#Separate_chaining[separate chaining]. An example figure representing the data structure is below:
-
-[#img-bucket-groups,.text-center]
-.A simple bucket group approach
-image::bucket-groups.png[align=center]
-
-An array of "buckets" is allocated and each bucket in turn points to its own individual linked list. This makes meeting the standard requirements of bucket iteration straight-forward. Unfortunately, iteration of the entire container is often times slow using this layout as each bucket must be examined for occupancy, yielding a time complexity of `O(bucket_count() + size())` when the standard requires complexity to be `O(size())`.
-
-Canonical standard implementations will wind up looking like the diagram below:
-
-[.text-center]
-.The canonical standard approach
-image::singly-linked.png[align=center,link=../diagrams/singly-linked.png,window=_blank]
-
-It's worth noting that this approach is only used by pass:[libc++] and pass:[libstdc++]; the MSVC Dinkumware implementation uses a different one. A more detailed analysis of the standard containers can be found http://bannalia.blogspot.com/2013/10/implementation-of-c-unordered.html[here].
-
-This unusually laid out data structure is chosen to make iteration of the entire container efficient by inter-connecting all of the nodes into a singly-linked list. One might also notice that buckets point to the node _before_ the start of the bucket's elements. This is done so that removing elements from the list can be done efficiently without introducing the need for a doubly-linked list. Unfortunately, this data structure introduces a guaranteed extra indirection. For example, to access the first element of a bucket, something like this must be done:
-
-```c++
-auto const idx = get_bucket_idx(hash_function(key));
-node* p = buckets[idx]; // first load
-node* n = p->next; // second load
-if (n && is_in_bucket(n, idx)) {
-  value_type const& v = *n; // third load
-  // ...
-}
-```
-
-With a simple bucket group layout, this is all that must be done:
-```c++
-auto const idx = get_bucket_idx(hash_function(key));
-node* n = buckets[idx]; // first load
-if (n) {
-  value_type const& v = *n; // second load
-  // ...
-}
-```
-
-In practice, the extra indirection can have a dramatic performance impact to common operations such as `insert`, `find` and `erase`. But to keep iteration of the container fast, Boost.Unordered introduces a novel data structure, a "bucket group". A bucket group is a fixed-width view of a subsection of the buckets array. It contains a bitmask (a `std::size_t`) which it uses to track occupancy of buckets and contains two pointers so that it can form a doubly-linked list with non-empty groups. An example diagram is below:
-
-[#img-fca-layout]
-.The new layout used by Boost
-image::fca.png[align=center]
-
-Thus container-wide iteration is turned into traversing the non-empty bucket groups (an operation with constant time complexity) which reduces the time complexity back to `O(size())`. In total, a bucket group is only 4 words in size and it views `sizeof(std::size_t) * CHAR_BIT` buckets meaning that for all common implementations, there's only 4 bits of space overhead per bucket introduced by the bucket groups.
-
-A more detailed description of Boost.Unordered's closed-addressing implementation is
-given in an
-https://bannalia.blogspot.com/2022/06/advancing-state-of-art-for.html[external article].
-For more information on implementation rationale, read the
-xref:#rationale_closed_addressing_containers[corresponding section].
-
-== Open Addressing Implementation
-
-The diagram shows the basic internal layout of `boost::unordered_flat_map`/`unordered_node_map` and
-`boost:unordered_flat_set`/`unordered_node_set`.
-
-
-[#img-foa-layout]
-.Open-addressing layout used by Boost.Unordered.
-image::foa.png[align=center]
-
-As with all open-addressing containers, elements (or pointers to the element nodes in the case of
-`boost::unordered_node_map` and `boost::unordered_node_set`) are stored directly in the bucket array.
-This array is logically divided into 2^_n_^ _groups_ of 15 elements each.
-In addition to the bucket array, there is an associated _metadata array_ with 2^_n_^
-16-byte words.
-
-[#img-foa-metadata]
-.Breakdown of a metadata word.
-image::foa-metadata.png[align=center]
-
-A metadata word is divided into 15 _h_~_i_~ bytes (one for each associated
-bucket), and an _overflow byte_ (_ofw_ in the diagram). The value of _h_~_i_~ is:
-
-  - 0 if the corresponding bucket is empty.
-  - 1 to encode a special empty bucket called a _sentinel_, which is used internally to
-  stop iteration when the container has been fully traversed.
-  - If the bucket is occupied, a _reduced hash value_ obtained from the hash value of
-  the element.
-
-When looking for an element with hash value _h_, SIMD technologies such as
-https://en.wikipedia.org/wiki/SSE2[SSE2] and
-https://en.wikipedia.org/wiki/ARM_architecture_family#Advanced_SIMD_(Neon)[Neon] allow us
-to very quickly inspect the full metadata word and look for the reduced value of _h_ among all the
-15 buckets with just a handful of CPU instructions: non-matching buckets can be
-readily discarded, and those whose reduced hash value matches need be inspected via full
-comparison with the corresponding element. If the looked-for element is not present,
-the overflow byte is inspected:
-
- If the bit in the position _h_ mod 8 is zero, lookup terminates (and the
-element is not present).
- If the bit is set to 1 (the group has been _overflowed_), further groups are
-checked using https://en.wikipedia.org/wiki/Quadratic_probing[_quadratic probing_], and
-the process is repeated.
-
-Insertion is algorithmically similar: empty buckets are located using SIMD,
-and when going past a full group its corresponding overflow bit is set to 1.
-
-In architectures without SIMD support, the logical layout stays the same, but the metadata
-word is codified using a technique we call _bit interleaving_: this layout allows us
-to emulate SIMD with reasonably good performance using only standard arithmetic and
-logical operations.
-
-[#img-foa-metadata-interleaving]
-.Bit-interleaved metadata word.
-image::foa-metadata-interleaving.png[align=center]
-
-A more detailed description of Boost.Unordered's open-addressing implementation is
-given in an
-https://bannalia.blogspot.com/2022/11/inside-boostunorderedflatmap.html[external article].
-For more information on implementation rationale, read the
-xref:#rationale_open_addresing_containers[corresponding section].
-
-== Concurrent Open Addressing Implementation
-
-`boost::concurrent_flat_map` uses the basic
-xref::#buckets_open_addressing_implementation[open-addressing layout] described above
-augmented with synchronization mechanisms.
-
-
-[#img-cfoa-layout]
-.Concurrent open-addressing layout used by Boost.Unordered.
-image::cfoa.png[align=center]
-
-Two levels of synchronization are used:
-
-* Container level: A read-write mutex is used to control access from any operation
-to the container. Typically, such access is in read mode (that is, concurrent) even
-for modifying operations, so for most practical purposes there is no thread
-contention at this level. Access is only in write mode (blocking) when rehashing or
-performing container-wide operations such as swapping or assignment.
-* Group level: Each 15-slot group is equipped with an 8-byte word containing:
-  ** A read-write spinlock for synchronized access to any element in the group.
-  ** An atomic _insertion counter_ used for optimistic insertion as described
-  below.
-
-By using atomic operations to access the group metadata, lookup is (group-level)
-lock-free up to the point where an actual comparison needs to be done with an element
-that has been previously SIMD-matched: only then it's the group's spinlock used.
-
-Insertion uses the following _optimistic algorithm_:
-
-* The value of the insertion counter for the initial group in the probe
-sequence is locally recorded (let's call this value `c0`).
-* Lookup is as described above. If lookup finds no equivalent element,
-search for an available slot for insertion successively locks/unlocks
-each group in the probing sequence.
-* When an available slot is located, it is preemptively occupied (its
-reduced hash value is set) and the insertion counter is atomically
-incremented: if no other thread has incremented the counter during the
-whole operation (which is checked by comparing with `c0`), then we're
-good to go and complete the insertion, otherwise we roll back and start
-over.
-
-This algorithm has very low contention both at the lookup and actual
-insertion phases in exchange for the possibility that computations have
-to be started over if some other thread interferes in the process by
-performing a succesful insertion beginning at the same group. In
-practice, the start-over frequency is extremely small, measured in the range
-of parts per million for some of our benchmarks.
-
-For more information on implementation rationale, read the
-xref:#rationale_concurrent_hashmap[corresponding section].
--- a/doc/unordered/changes.adoc
+++ b/doc/unordered/changes.adoc
@ -6,8 +6,9 @@
 :github-pr-url: https://github.com/boostorg/unordered/pull
 :cpp: C++

-== Release 1.83.0
+== Release 1.83.0 - Major update

+* Added `boost::concurrent_flat_map`, a fast, thread-safe hashmap based on open addressing.
 * Sped up iteration of open-addressing containers.

 == Release 1.82.0 - Major update
--- a/doc/unordered/compliance.adoc
+++ b/doc/unordered/compliance.adoc
@ -5,7 +5,7 @@

 :cpp: C++

-== Closed-addressing containers
+== Closed-addressing Containers

 `unordered_[multi]set` and `unordered_[multi]map` are intended to provide a conformant
 implementation of the {cpp}20 standard that will work with {cpp}98 upwards.
@ -13,7 +13,7 @@ This wide compatibility does mean some compromises have to be made.
 With a compiler and library that fully support {cpp}11, the differences should
 be minor.

-=== Move emulation
+=== Move Emulation

 Support for move semantics is implemented using Boost.Move. If rvalue
 references are available it will use them, but if not it uses a close,
@ -25,7 +25,7 @@ but imperfect emulation. On such compilers:
 * The containers themselves are not movable.
 * Argument forwarding is not perfect.

-=== Use of allocators
+=== Use of Allocators

 {cpp}11 introduced a new allocator system. It's backwards compatible due to
 the lax requirements for allocators in the old standard, but might need
@ -58,7 +58,7 @@ Due to imperfect move emulation, some assignments might check
 `propagate_on_container_copy_assignment` on some compilers and
 `propagate_on_container_move_assignment` on others.

-=== Construction/Destruction using allocators
+=== Construction/Destruction Using Allocators

 The following support is required for full use of {cpp}11 style
 construction/destruction:
@ -117,7 +117,7 @@ Variadic constructor arguments for `emplace` are only used when both
 rvalue references and variadic template parameters are available.
 Otherwise `emplace` can only take up to 10 constructors arguments.

-== Open-addressing containers
+== Open-addressing Containers

 The C++ standard does not currently provide any open-addressing container
 specification to adhere to, so `boost::unordered_flat_set`/`unordered_node_set` and
@ -144,7 +144,7 @@ The main differences with C++ unordered associative containers are:
  ** Pointer stability is not kept under rehashing.
  ** There is no API for node extraction/insertion.

-== Concurrent Hashmap
+== Concurrent Containers

 There is currently no specification in the C++ standard for this or any other concurrent
 data structure. `boost::concurrent_flat_map` takes the same template parameters as `std::unordered_map`
--- a/doc/unordered/concurrent_flat_map_intro.adoc
+++ b/doc/unordered/concurrent_flat_map_intro.adoc
@ -1,8 +1,9 @@
-[#concurrent_flat_map_intro]
-= An introduction to boost::concurrent_flat_map
+[#concurrent]
+= Concurrent Containers

-:idprefix: concurrent_flat_map_intro_
+:idprefix: concurrent_

+Boost.Unordered currently provides just one concurrent container named `boost::concurrent_flat_map`.
 `boost::concurrent_flat_map` is a hash table that allows concurrent write/read access from
 different threads without having to implement any synchronzation mechanism on the user's side.

@ -131,7 +132,7 @@ by using `cvisit` overloads (for instance, `insert_or_cvisit`) and may result
 in higher parallelization. Consult the xref:#concurrent_flat_map[reference]
 for a complete list of available operations.

-== Whole-table visitation
+== Whole-table Visitation

 In the absence of iterators, `boost::concurrent_flat_map` provides `visit_all`
 as an alternative way to process all the elements in the map:
@ -168,7 +169,7 @@ may be inserted, modified or erased by other threads during visitation. It is
 advisable not to assume too much about the exact global state of a `boost::concurrent_flat_map`
 at any point in your program.

-== Blocking operations
+== Blocking Operations

 ``boost::concurrent_flat_map``s can be copied, assigned, cleared and merged just like any
 Boost.Unordered container. Unlike most other operations, these are _blocking_,
@ -177,5 +178,5 @@ clear or merge operation is in progress. Blocking is taken care of automatically
 and the user need not take any special precaution, but overall performance may be affected.

 Another blocking operation is _rehashing_, which happens explicitly via `rehash`/`reserve`
-or during insertion when the table's load hits `max_load()`. As with non-concurrent hashmaps,
+or during insertion when the table's load hits `max_load()`. As with non-concurrent containers,
 reserving space in advance of bulk insertions will generally speed up the process.
--- a/doc/unordered/intro.adoc
+++ b/doc/unordered/intro.adoc
@ -4,146 +4,22 @@
 :idprefix: intro_
 :cpp: C++

-For accessing data based on key lookup, the {cpp} standard library offers `std::set`,
-`std::map`, `std::multiset` and `std::multimap`. These are generally
-implemented using balanced binary trees so that lookup time has
-logarithmic complexity. That is generally okay, but in many cases a
-link:https://en.wikipedia.org/wiki/Hash_table[hash table^] can perform better, as accessing data has constant complexity,
-on average. The worst case complexity is linear, but that occurs rarely and
-with some care, can be avoided.
+link:https://en.wikipedia.org/wiki/Hash_table[Hash tables^] are extremely popular
+computer data structures and can be found under one form or another in virtually any programming
+language. Whereas other associative structures such as rb-trees (used in {cpp} by `std::set` and `std::map`)
+have logarithmic-time complexity for insertion and lookup, hash tables, if configured properly,
+perform these operations in constant time on average, and are generally much faster.

-Also, the existing containers require a 'less than' comparison object
-to order their elements. For some data types this is impossible to implement
-or isn't practical. In contrast, a hash table only needs an equality function
-and a hash function for the key.
+{cpp} introduced __unordered associative containers__ `std::unordered_set`, `std::unordered_map`,
+`std::unordered_multiset` and `std::unordered_multimap` in {cpp}11, but research on hash tables
+hasn't stopped since: advances in CPU architectures such as
+more powerful caches, link:https://en.wikipedia.org/wiki/Single_instruction,_multiple_data[SIMD] operations
+and increasingly available link:https://en.wikipedia.org/wiki/Multi-core_processor[multicore processors]
+open up possibilities for improved hash-based data structures and new use cases that
+are simply beyond reach of unordered associative containers as specified in 2011.

-With this in mind, unordered associative containers were added to the {cpp}
-standard. Boost.Unordered provides an implementation of the containers described in {cpp}11,
-with some <<compliance,deviations from the standard>> in
-order to work with non-{cpp}11 compilers and libraries.
-
-`unordered_set` and `unordered_multiset` are defined in the header
-`<boost/unordered/unordered_set.hpp>`
-[source,c++]
----  
-namespace boost {
-    template <
-        class Key,
-        class Hash = boost::hash<Key>,
-        class Pred = std::equal_to<Key>,
-        class Alloc = std::allocator<Key> >
-    class unordered_set;
-
-    template<
-        class Key,
-        class Hash = boost::hash<Key>, 
-        class Pred = std::equal_to<Key>, 
-        class Alloc = std::allocator<Key> > 
-    class unordered_multiset;
-}
----
-
-`unordered_map` and `unordered_multimap` are defined in the header
-`<boost/unordered/unordered_map.hpp>`
-
-[source,c++]
----
-namespace boost {
-    template <
-        class Key, class Mapped,
-        class Hash = boost::hash<Key>,
-        class Pred = std::equal_to<Key>,
-        class Alloc = std::allocator<std::pair<Key const, Mapped> > >
-    class unordered_map;
-
-    template<
-        class Key, class Mapped,
-        class Hash = boost::hash<Key>,
-        class Pred = std::equal_to<Key>,
-        class Alloc = std::allocator<std::pair<Key const, Mapped> > >
-    class unordered_multimap;
-}
----
-
-These containers, and all other implementations of standard unordered associative
-containers, use an approach to its internal data structure design called
-*closed addressing*. Starting in Boost 1.81, Boost.Unordered also provides containers
-`boost::unordered_flat_set` and `boost::unordered_flat_map`, which use a
-different data structure strategy commonly known as *open addressing* and depart in
-a small number of ways from the standard so as to offer much better performance
-in exchange (more than 2 times faster in typical scenarios):
-
-
-[source,c++]
----
-// #include <boost/unordered/unordered_flat_set.hpp>
-//
-// Note: no multiset version
-
-namespace boost {
-    template <
-        class Key,
-        class Hash = boost::hash<Key>,
-        class Pred = std::equal_to<Key>,
-        class Alloc = std::allocator<Key> >
-    class unordered_flat_set;
-}
----
-
-[source,c++]
----
-// #include <boost/unordered/unordered_flat_map.hpp>
-//
-// Note: no multimap version
-
-namespace boost {
-    template <
-        class Key, class Mapped,
-        class Hash = boost::hash<Key>,
-        class Pred = std::equal_to<Key>,
-        class Alloc = std::allocator<std::pair<Key const, Mapped> > >
-    class unordered_flat_map;
-}
----
-
-Starting in Boost 1.82, the containers `boost::unordered_node_set` and `boost::unordered_node_map`
-are introduced: they use open addressing like `boost::unordered_flat_set` and `boost::unordered_flat_map`,
-but internally store element _nodes_, like `boost::unordered_set` and `boost::unordered_map`,
-which provide stability of pointers and references to the elements:
-
-[source,c++]
----
-// #include <boost/unordered/unordered_node_set.hpp>
-//
-// Note: no multiset version
-
-namespace boost {
-    template <
-        class Key,
-        class Hash = boost::hash<Key>,
-        class Pred = std::equal_to<Key>,
-        class Alloc = std::allocator<Key> >
-    class unordered_node_set;
-}
----
-
-[source,c++]
----
-// #include <boost/unordered/unordered_node_map.hpp>
-//
-// Note: no multimap version
-
-namespace boost {
-    template <
-        class Key, class Mapped,
-        class Hash = boost::hash<Key>,
-        class Pred = std::equal_to<Key>,
-        class Alloc = std::allocator<std::pair<Key const, Mapped> > >
-    class unordered_node_map;
-}
----
-
-These are all the containers provided by Boost.Unordered:
+Boost.Unordered offers a catalog of hash containers with different standards compliance levels,
+performances and intented usage scenarios:

 [caption=, title='Table {counter:table-counter}. Boost.Unordered containers']
 [cols="1,1,.^1", frame=all, grid=rows]
@ -165,44 +41,49 @@ These are all the containers provided by Boost.Unordered:
 ^| `boost::unordered_flat_set` +
 `boost::unordered_flat_map`

+^.^h|*Concurrent*
+^|
+^| `boost::concurrent_flat_map`
+
 |===

-Closed-addressing containers are pass:[C++]98-compatible. Open-addressing containers require a
-reasonably compliant pass:[C++]11 compiler.
+* **Closed-addressing containers** are fully compliant with the C++ specification
+for unordered associative containers and feature one of the fastest implementations
+in the market within the technical constraints imposed by the required standard interface.
+* **Open-addressing containers** rely on much faster data structures and algorithms
+(more than 2 times faster in typical scenarios) while slightly diverging from the standard
+interface to accommodate the implementation.
+There are two variants: **flat** (the fastest) and **node-based**, which 
+provide pointer stability under rehashing at the expense of being slower.
+* Finally, `boost::concurrent_flat_map` (the only **concurrent container** provided
+at present) is a hashmap designed and implemented to be used in high-performance
+multithreaded scenarios. Its interface is radically different from that of regular C++ containers.

-Boost.Unordered containers are used in a similar manner to the normal associative
-containers:
-
-[source,cpp]
----
-typedef boost::unordered_map<std::string, int> map;
-map x;
-x["one"] = 1;
-x["two"] = 2;
-x["three"] = 3;
-
-assert(x.at("one") == 1);
-assert(x.find("missing") == x.end());
----
-
-But since the elements aren't ordered, the output of:
+All sets and maps in Boost.Unordered are instantiatied similarly as
+`std::unordered_set` and `std::unordered_map`, respectively:

 [source,c++]
----
-for(const map::value_type& i: x) {
-    std::cout<<i.first<<","<<i.second<<"\n";
+----  
+namespace boost {
+    template <
+        class Key,
+        class Hash = boost::hash<Key>,
+        class Pred = std::equal_to<Key>,
+        class Alloc = std::allocator<Key> >
+    class unordered_set; 
+    // same for unordered_multiset, unordered_flat_set, unordered_node_set
+
+    template <
+        class Key, class Mapped,
+        class Hash = boost::hash<Key>,
+        class Pred = std::equal_to<Key>,
+        class Alloc = std::allocator<std::pair<Key const, Mapped> > >
+    class unordered_map;
+    // same for unordered_multimap, unordered_flat_map, unordered_node_map
+    // and concurrent_flat_map
 }
 ----

-can be in any order. For example, it might be:
-
-[source]
----
-two,2
-one,1
-three,3
----
-
 To store an object in an unordered associative container requires both a
 key equality function and a hash function. The default function objects in
 the standard containers support a few basic types including integer types,
@ -213,16 +94,3 @@ you have to extend Boost.Hash to support the type or use
 your own custom equality predicates and hash functions. See the
 <<hash_equality,Equality Predicates and Hash Functions>> section
 for more details.
-
-There are other differences, which are listed in the
-<<comparison,Comparison with Associative Containers>> section.
-
-== A concurrent hashmap
-
-Starting in Boost 1.83, Boost.Unordered provides `boost::concurrent_flat_map`,
-a thread-safe hash table for high performance multithreaded scenarios. Although
-it shares the internal data structure and most of the algorithms with Boost.Unordered
-open-addressing `boost::unordered_flat_map`, ``boost::concurrent_flat_map``'s API departs significantly
-from that of C++ unordered associative containers to make this table suitable for
-concurrent usage. Consult the xref:#concurrent_flat_map_intro[dedicated tutorial]
-for more information.
--- a/doc/unordered/rationale.adoc
+++ b/doc/unordered/rationale.adoc
@ -4,7 +4,7 @@

 = Implementation Rationale

-== Closed-addressing containers 
+== Closed-addressing Containers

 `boost::unordered_[multi]set` and `boost::unordered_[multi]map`
 adhere to the standard requirements for unordered associative
@ -74,7 +74,7 @@ Since release 1.80.0, prime numbers are chosen for the number of buckets in
 tandem with sophisticated modulo arithmetic. This removes the need for "mixing"
 the result of the user's hash function as was used for release 1.79.0.

-== Open-addresing containers 
+== Open-addresing Containers 

 The C++ standard specification of unordered associative containers impose
 severe limitations on permissible implementations, the most important being
@ -86,7 +86,7 @@ The design of `boost::unordered_flat_set`/`unordered_node_set` and `boost::unord
 guided by Peter Dimov's https://pdimov.github.io/articles/unordered_dev_plan.html[Development Plan for Boost.Unordered^].
 We discuss here the most relevant principles.

-=== Hash function
+=== Hash Function

 Given its rich functionality and cross-platform interoperability,
 `boost::hash` remains the default hash function of open-addressing containers.
@ -105,7 +105,7 @@ whereas in 32 bits _C_ = 0xE817FB2Du has been obtained from https://arxiv.org/ab
 When using a hash function directly suitable for open addressing, post-mixing can be opted out by via a dedicated <<hash_traits_hash_is_avalanching,`hash_is_avalanching`>>trait.
 `boost::hash` specializations for string types are marked as avalanching.

-=== Platform interoperability
+=== Platform Interoperability

 The observable behavior of `boost::unordered_flat_set`/`unordered_node_set` and `boost::unordered_flat_map`/`unordered_node_map` is deterministically
 identical across different compilers as long as their ``std::size_t``s are the same size and the user-provided
@ -118,7 +118,7 @@ and https://en.wikipedia.org/wiki/ARM_architecture_family#Advanced_SIMD_(NEON)[N
 this does not affect interoperatility. For instance, the behavior is the same
 for Visual Studio on an x64-mode Intel CPU with SSE2 and for GCC on an IBM s390x without any supported SIMD technology.

-== Concurrent Hashmap
+== Concurrent Containers

 The same data structure used by Boost.Unordered open-addressing containers has been chosen
 also as the foundation of `boost::concurrent_flat_map`:
@ -132,7 +132,7 @@ lookup that are lock-free up to the last step of actual element comparison.
 of all elements between `boost::concurrent_flat_map` and `boost::unordered_flat_map`.
 (This feature has not been implemented yet.)

-=== Hash function and platform interoperability
+=== Hash Function and Platform Interoperability

 `boost::concurrent_flat_map` makes the same decisions and provides the same guarantees
 as Boost.Unordered open-addressing containers with regards to 
--- a/doc/unordered/comparison.adoc
+++ b/doc/unordered/comparison.adoc
@ -1,8 +1,99 @@
+[#regular]
+= Regular Containers
+
+:idprefix: regular_
+
+Boost.Unordered closed-addressing containers (`boost::unordered_set`, `boost::unordered_map`,
+`boost::unordered_multiset` and `boost::unordered_multimap`) are fully conformant with the
+C++ specification for unordered associative containers, so for those who know how to use 
+`std::unordered_set`, `std::unordered_map`, etc., their homonyms in Boost:Unordered are
+drop-in replacements. The interface of open-addressing containers (`boost::unordered_node_set`, 
+`boost::unordered_node_map`, `boost::unordered_flat_set` and `boost::unordered_flat_map`)
+is very similar, but they present some minor differences listed in the dedicated
+xref:#compliance_open_addressing_containers[standard compliance section].
+
+
+For readers without previous experience with hash containers but familiar
+with normal associatve containers (`std::set`, `std::map`,
+`std::multiset` and `std::multimap`), Boost.Unordered containers are used in a similar manner:
+
+[source,cpp]
+----
+typedef boost::unordered_map<std::string, int> map;
+map x;
+x["one"] = 1;
+x["two"] = 2;
+x["three"] = 3;
+
+assert(x.at("one") == 1);
+assert(x.find("missing") == x.end());
+----
+
+But since the elements aren't ordered, the output of:
+
+[source,c++]
+----
+for(const map::value_type& i: x) {
+    std::cout<<i.first<<","<<i.second<<"\n";
+}
+----
+
+can be in any order. For example, it might be:
+
+[source]
+----
+two,2
+one,1
+three,3
+----
+
+There are other differences, which are listed in the
+<<comparison,Comparison with Associative Containers>> section.
+
+== Iterator Invalidation
+
+It is not specified how member functions other than `rehash` and `reserve` affect
+the bucket count, although `insert` can only invalidate iterators
+when the insertion causes the container's load to be greater than the maximum allowed.
+For most implementations this means that `insert` will only
+change the number of buckets when this happens. Iterators can be
+invalidated by calls to `insert`, `rehash` and `reserve`.
+
+As for pointers and references,
+they are never invalidated for node-based containers 
+(`boost::unordered_[multi]set`, `boost::unordered_[multi]map`, `boost::unordered_node_set`, `boost::unordered_node_map`),
+but they will when rehashing occurs for
+`boost::unordered_flat_set` and `boost::unordered_flat_map`: this is because
+these containers store elements directly into their holding buckets, so
+when allocating a new bucket array the elements must be transferred by means of move construction.
+
+In a similar manner to using `reserve` for ``vector``s, it can be a good idea
+to call `reserve` before inserting a large number of elements. This will get
+the expensive rehashing out of the way and let you store iterators, safe in
+the knowledge that they won't be invalidated. If you are inserting `n`
+elements into container `x`, you could first call:
+
+```
+x.reserve(n);
+```
+
+Note:: `reserve(n)` reserves space for at least `n` elements, allocating enough buckets
+so as to not exceed the maximum load factor.
+
+Because the maximum load factor is defined as the number of elements divided by the total
+number of available buckets, this function is logically equivalent to:
+
+```
+x.rehash(std::ceil(n / x.max_load_factor()))
+```
+
+See the <<unordered_map_rehash,reference for more details>> on the `rehash` function.
+
 [#comparison]

 :idprefix: comparison_

-= Comparison with Associative Containers
+== Comparison with Associative Containers

 [caption=, title='Table {counter:table-counter} Interface differences']
 [cols="1,1", frame=all, grid=rows]
@ -32,7 +123,7 @@
 |`iterator`, `const_iterator` are of at least the forward category.

 |Iterators, pointers and references to the container's elements are never invalidated.
-|<<buckets_iterator_invalidation,Iterators can be invalidated by calls to insert or rehash>>. +
+|<<regular_iterator_invalidation,Iterators can be invalidated by calls to insert or rehash>>. +
 **Node-based containers:** Pointers and references to the container's elements are never invalidated. +
 **Flat containers:** Pointers and references to the container's elements are invalidated when rehashing occurs.

--- a/doc/unordered/structures.adoc
+++ b/doc/unordered/structures.adoc
@ -0,0 +1,179 @@
+[#structures]
+= Data Structures
+
+:idprefix: structures_
+
+== Closed-addressing Containers
+
++++
+<style>
+  .imageblock > .title {
+    text-align: inherit;
+  }
+</style>
++++
+
+Boost.Unordered sports one of the fastest implementations of closed addressing, also commonly known as https://en.wikipedia.org/wiki/Hash_table#Separate_chaining[separate chaining]. An example figure representing the data structure is below:
+
+[#img-bucket-groups,.text-center]
+.A simple bucket group approach
+image::bucket-groups.png[align=center]
+
+An array of "buckets" is allocated and each bucket in turn points to its own individual linked list. This makes meeting the standard requirements of bucket iteration straight-forward. Unfortunately, iteration of the entire container is often times slow using this layout as each bucket must be examined for occupancy, yielding a time complexity of `O(bucket_count() + size())` when the standard requires complexity to be `O(size())`.
+
+Canonical standard implementations will wind up looking like the diagram below:
+
+[.text-center]
+.The canonical standard approach
+image::singly-linked.png[align=center,link=../diagrams/singly-linked.png,window=_blank]
+
+It's worth noting that this approach is only used by pass:[libc++] and pass:[libstdc++]; the MSVC Dinkumware implementation uses a different one. A more detailed analysis of the standard containers can be found http://bannalia.blogspot.com/2013/10/implementation-of-c-unordered.html[here].
+
+This unusually laid out data structure is chosen to make iteration of the entire container efficient by inter-connecting all of the nodes into a singly-linked list. One might also notice that buckets point to the node _before_ the start of the bucket's elements. This is done so that removing elements from the list can be done efficiently without introducing the need for a doubly-linked list. Unfortunately, this data structure introduces a guaranteed extra indirection. For example, to access the first element of a bucket, something like this must be done:
+
+```c++
+auto const idx = get_bucket_idx(hash_function(key));
+node* p = buckets[idx]; // first load
+node* n = p->next; // second load
+if (n && is_in_bucket(n, idx)) {
+  value_type const& v = *n; // third load
+  // ...
+}
+```
+
+With a simple bucket group layout, this is all that must be done:
+```c++
+auto const idx = get_bucket_idx(hash_function(key));
+node* n = buckets[idx]; // first load
+if (n) {
+  value_type const& v = *n; // second load
+  // ...
+}
+```
+
+In practice, the extra indirection can have a dramatic performance impact to common operations such as `insert`, `find` and `erase`. But to keep iteration of the container fast, Boost.Unordered introduces a novel data structure, a "bucket group". A bucket group is a fixed-width view of a subsection of the buckets array. It contains a bitmask (a `std::size_t`) which it uses to track occupancy of buckets and contains two pointers so that it can form a doubly-linked list with non-empty groups. An example diagram is below:
+
+[#img-fca-layout]
+.The new layout used by Boost
+image::fca.png[align=center]
+
+Thus container-wide iteration is turned into traversing the non-empty bucket groups (an operation with constant time complexity) which reduces the time complexity back to `O(size())`. In total, a bucket group is only 4 words in size and it views `sizeof(std::size_t) * CHAR_BIT` buckets meaning that for all common implementations, there's only 4 bits of space overhead per bucket introduced by the bucket groups.
+
+A more detailed description of Boost.Unordered's closed-addressing implementation is
+given in an
+https://bannalia.blogspot.com/2022/06/advancing-state-of-art-for.html[external article].
+For more information on implementation rationale, read the
+xref:#rationale_closed_addressing_containers[corresponding section].
+
+== Open-addressing Containers
+
+The diagram shows the basic internal layout of `boost::unordered_flat_map`/`unordered_node_map` and
+`boost:unordered_flat_set`/`unordered_node_set`.
+
+
+[#img-foa-layout]
+.Open-addressing layout used by Boost.Unordered.
+image::foa.png[align=center]
+
+As with all open-addressing containers, elements (or pointers to the element nodes in the case of
+`boost::unordered_node_map` and `boost::unordered_node_set`) are stored directly in the bucket array.
+This array is logically divided into 2^_n_^ _groups_ of 15 elements each.
+In addition to the bucket array, there is an associated _metadata array_ with 2^_n_^
+16-byte words.
+
+[#img-foa-metadata]
+.Breakdown of a metadata word.
+image::foa-metadata.png[align=center]
+
+A metadata word is divided into 15 _h_~_i_~ bytes (one for each associated
+bucket), and an _overflow byte_ (_ofw_ in the diagram). The value of _h_~_i_~ is:
+
+  - 0 if the corresponding bucket is empty.
+  - 1 to encode a special empty bucket called a _sentinel_, which is used internally to
+  stop iteration when the container has been fully traversed.
+  - If the bucket is occupied, a _reduced hash value_ obtained from the hash value of
+  the element.
+
+When looking for an element with hash value _h_, SIMD technologies such as
+https://en.wikipedia.org/wiki/SSE2[SSE2] and
+https://en.wikipedia.org/wiki/ARM_architecture_family#Advanced_SIMD_(Neon)[Neon] allow us
+to very quickly inspect the full metadata word and look for the reduced value of _h_ among all the
+15 buckets with just a handful of CPU instructions: non-matching buckets can be
+readily discarded, and those whose reduced hash value matches need be inspected via full
+comparison with the corresponding element. If the looked-for element is not present,
+the overflow byte is inspected:
+
+- If the bit in the position _h_ mod 8 is zero, lookup terminates (and the
+element is not present).
+- If the bit is set to 1 (the group has been _overflowed_), further groups are
+checked using https://en.wikipedia.org/wiki/Quadratic_probing[_quadratic probing_], and
+the process is repeated.
+
+Insertion is algorithmically similar: empty buckets are located using SIMD,
+and when going past a full group its corresponding overflow bit is set to 1.
+
+In architectures without SIMD support, the logical layout stays the same, but the metadata
+word is codified using a technique we call _bit interleaving_: this layout allows us
+to emulate SIMD with reasonably good performance using only standard arithmetic and
+logical operations.
+
+[#img-foa-metadata-interleaving]
+.Bit-interleaved metadata word.
+image::foa-metadata-interleaving.png[align=center]
+
+A more detailed description of Boost.Unordered's open-addressing implementation is
+given in an
+https://bannalia.blogspot.com/2022/11/inside-boostunorderedflatmap.html[external article].
+For more information on implementation rationale, read the
+xref:#rationale_open_addresing_containers[corresponding section].
+
+== Concurrent Containers
+
+`boost::concurrent_flat_map` uses the basic
+xref:#structures_open_addressing_containers[open-addressing layout] described above
+augmented with synchronization mechanisms.
+
+
+[#img-cfoa-layout]
+.Concurrent open-addressing layout used by Boost.Unordered.
+image::cfoa.png[align=center]
+
+Two levels of synchronization are used:
+
+* Container level: A read-write mutex is used to control access from any operation
+to the container. Typically, such access is in read mode (that is, concurrent) even
+for modifying operations, so for most practical purposes there is no thread
+contention at this level. Access is only in write mode (blocking) when rehashing or
+performing container-wide operations such as swapping or assignment.
+* Group level: Each 15-slot group is equipped with an 8-byte word containing:
+  ** A read-write spinlock for synchronized access to any element in the group.
+  ** An atomic _insertion counter_ used for optimistic insertion as described
+  below.
+
+By using atomic operations to access the group metadata, lookup is (group-level)
+lock-free up to the point where an actual comparison needs to be done with an element
+that has been previously SIMD-matched: only then it's the group's spinlock used.
+
+Insertion uses the following _optimistic algorithm_:
+
+* The value of the insertion counter for the initial group in the probe
+sequence is locally recorded (let's call this value `c0`).
+* Lookup is as described above. If lookup finds no equivalent element,
+search for an available slot for insertion successively locks/unlocks
+each group in the probing sequence.
+* When an available slot is located, it is preemptively occupied (its
+reduced hash value is set) and the insertion counter is atomically
+incremented: if no other thread has incremented the counter during the
+whole operation (which is checked by comparing with `c0`), then we're
+good to go and complete the insertion, otherwise we roll back and start
+over.
+
+This algorithm has very low contention both at the lookup and actual
+insertion phases in exchange for the possibility that computations have
+to be started over if some other thread interferes in the process by
+performing a succesful insertion beginning at the same group. In
+practice, the start-over frequency is extremely small, measured in the range
+of parts per million for some of our benchmarks.
+
+For more information on implementation rationale, read the
+xref:#rationale_concurrent_containers[corresponding section].