refactored to modernize and improve flow

2025-10-01 18:21:02 +02:00 · 2023-05-18 20:18:58 +02:00
parent ff10b287e2
commit 3d640ac032
9 changed files with 354 additions and 425 deletions
--- a/doc/unordered.adoc
+++ b/doc/unordered.adoc
@@ -13,9 +13,10 @@
 include::unordered/intro.adoc[]
 include::unordered/buckets.adoc[]
 include::unordered/hash_equality.adoc[]
-include::unordered/comparison.adoc[]
+include::unordered/regular.adoc[]
-include::unordered/concurrent_flat_map_intro.adoc[]
+include::unordered/concurrent.adoc[]
 include::unordered/compliance.adoc[]
 include::unordered/structures.adoc[]
 include::unordered/benchmarks.adoc[]
 include::unordered/rationale.adoc[]
 include::unordered/ref.adoc[]
--- a/doc/unordered/buckets.adoc
+++ b/doc/unordered/buckets.adoc
@@ -2,9 +2,9 @@
 :idprefix: buckets_
 :imagesdir: ../diagrams
-= The Data Structure
+= Basics of Hash Tables
-The containers are made up of a number of 'buckets', each of which can contain
+The containers are made up of a number of _buckets_, each of which can contain
 any number of elements. For example, the following diagram shows a <<unordered_set,`boost::unordered_set`>> with 7 buckets containing 5 elements, `A`,
 `B`, `C`, `D` and `E` (this is just for illustration, containers will typically
 have more buckets).
@@ -12,8 +12,7 @@ have more buckets).
 image::buckets.png[]
 In order to decide which bucket to place an element in, the container applies
-the hash function, `Hash`, to the element's key (for `unordered_set` and
+the hash function, `Hash`, to the element's key (for sets the key is the whole element, but is referred to as the key
 `unordered_multiset` the key is the whole element, but is referred to as the key
 so that the same terminology can be used for sets and maps). This returns a
 value of type `std::size_t`. `std::size_t` has a much greater range of values
 then the number of buckets, so the container applies another transformation to
@@ -80,7 +79,7 @@ h|*Method* h|*Description*
 |===
-== Controlling the number of buckets
+== Controlling the Number of Buckets
 As more elements are added to an unordered associative container, the number
 of collisions will increase causing performance to degrade.
@@ -90,8 +89,8 @@ calling `rehash`.
 The standard leaves a lot of freedom to the implementer to decide how the
 number of buckets is chosen, but it does make some requirements based on the
-container's 'load factor', the number of elements divided by the number of buckets.
+container's _load factor_, the number of elements divided by the number of buckets.
-Containers also have a 'maximum load factor' which they should try to keep the
+Containers also have a _maximum load factor_ which they should try to keep the
 load factor below.
 You can't control the bucket count directly but there are two ways to
@@ -133,9 +132,10 @@ h|*Method* h|*Description*
 |`void rehash(size_type n)`
 |Changes the number of buckets so that there at least `n` buckets, and so that the load factor is less than the maximum load factor.
-2+^h| *Open-addressing containers only* +
+2+^h| *Open-addressing and concurrent containers only* +
 `boost::unordered_flat_set`, `boost::unordered_flat_map` +
 `boost::unordered_node_set`, `boost::unordered_node_map` +
 `boost::concurrent_flat_map`
 h|*Method* h|*Description*
 |`size_type max_load() const`
@@ -143,7 +143,7 @@ h|*Method* h|*Description*
 |===
-A note on `max_load` for open-addressing containers: the maximum load will be 
+A note on `max_load` for open-addressing and concurrent containers: the maximum load will be 
 (`max_load_factor() * bucket_count()`) right after `rehash` or on container creation, but may
 slightly decrease when erasing elements in high-load situations. For instance, if we
 have a <<unordered_flat_map,`boost::unordered_flat_map`>> with `size()` almost
@@ -151,216 +151,4 @@ at `max_load()` level and then erase 1,000 elements, `max_load()` may decrease b
 few dozen elements. This is done internally by Boost.Unordered in order
 to keep its performance stable, and must be taken into account when planning for rehash-free insertions.
 == Iterator Invalidation
 It is not specified how member functions other than `rehash` and `reserve` affect
 the bucket count, although `insert` can only invalidate iterators
 when the insertion causes the container's load to be greater than the maximum allowed.
 For most implementations this means that `insert` will only
 change the number of buckets when this happens. Iterators can be
 invalidated by calls to `insert`, `rehash` and `reserve`.
 As for pointers and references,
 they are never invalidated for node-based containers 
 (`boost::unordered_[multi]set`, `boost::unordered_[multi]map`, `boost::unordered_node_set`, `boost::unordered_node_map`),
 but they will when rehashing occurs for
 `boost::unordered_flat_set` and `boost::unordered_flat_map`: this is because
 these containers store elements directly into their holding buckets, so
 when allocating a new bucket array the elements must be transferred by means of move construction.
 In a similar manner to using `reserve` for ``vector``s, it can be a good idea
 to call `reserve` before inserting a large number of elements. This will get
 the expensive rehashing out of the way and let you store iterators, safe in
 the knowledge that they won't be invalidated. If you are inserting `n`
 elements into container `x`, you could first call:
 ```
 x.reserve(n);
 ```
 Note:: `reserve(n)` reserves space for at least `n` elements, allocating enough buckets
 so as to not exceed the maximum load factor.
 +
 Because the maximum load factor is defined as the number of elements divided by the total
 number of available buckets, this function is logically equivalent to:
 +
 ```
 x.rehash(std::ceil(n / x.max_load_factor()))
 ```
 +
 See the <<unordered_map_rehash,reference for more details>> on the `rehash` function.
 == Fast Closed Addressing Implementation
 ++++
 <style>
  .imageblock > .title {
    text-align: inherit;
  }
 </style>
 ++++
 Boost.Unordered sports one of the fastest implementations of closed addressing, also commonly known as https://en.wikipedia.org/wiki/Hash_table#Separate_chaining[separate chaining]. An example figure representing the data structure is below:
 [#img-bucket-groups,.text-center]
 .A simple bucket group approach
 image::bucket-groups.png[align=center]
 An array of "buckets" is allocated and each bucket in turn points to its own individual linked list. This makes meeting the standard requirements of bucket iteration straight-forward. Unfortunately, iteration of the entire container is often times slow using this layout as each bucket must be examined for occupancy, yielding a time complexity of `O(bucket_count() + size())` when the standard requires complexity to be `O(size())`.
 Canonical standard implementations will wind up looking like the diagram below:
 [.text-center]
 .The canonical standard approach
 image::singly-linked.png[align=center,link=../diagrams/singly-linked.png,window=_blank]
 It's worth noting that this approach is only used by pass:[libc++] and pass:[libstdc++]; the MSVC Dinkumware implementation uses a different one. A more detailed analysis of the standard containers can be found http://bannalia.blogspot.com/2013/10/implementation-of-c-unordered.html[here].
 This unusually laid out data structure is chosen to make iteration of the entire container efficient by inter-connecting all of the nodes into a singly-linked list. One might also notice that buckets point to the node _before_ the start of the bucket's elements. This is done so that removing elements from the list can be done efficiently without introducing the need for a doubly-linked list. Unfortunately, this data structure introduces a guaranteed extra indirection. For example, to access the first element of a bucket, something like this must be done:
 ```c++
 auto const idx = get_bucket_idx(hash_function(key));
 node* p = buckets[idx]; // first load
 node* n = p->next; // second load
 if (n && is_in_bucket(n, idx)) {
  value_type const& v = *n; // third load
  // ...
 }
 ```
 With a simple bucket group layout, this is all that must be done:
 ```c++
 auto const idx = get_bucket_idx(hash_function(key));
 node* n = buckets[idx]; // first load
 if (n) {
  value_type const& v = *n; // second load
  // ...
 }
 ```
 In practice, the extra indirection can have a dramatic performance impact to common operations such as `insert`, `find` and `erase`. But to keep iteration of the container fast, Boost.Unordered introduces a novel data structure, a "bucket group". A bucket group is a fixed-width view of a subsection of the buckets array. It contains a bitmask (a `std::size_t`) which it uses to track occupancy of buckets and contains two pointers so that it can form a doubly-linked list with non-empty groups. An example diagram is below:
 [#img-fca-layout]
 .The new layout used by Boost
 image::fca.png[align=center]
 Thus container-wide iteration is turned into traversing the non-empty bucket groups (an operation with constant time complexity) which reduces the time complexity back to `O(size())`. In total, a bucket group is only 4 words in size and it views `sizeof(std::size_t) * CHAR_BIT` buckets meaning that for all common implementations, there's only 4 bits of space overhead per bucket introduced by the bucket groups.
 A more detailed description of Boost.Unordered's closed-addressing implementation is
 given in an
 https://bannalia.blogspot.com/2022/06/advancing-state-of-art-for.html[external article].
 For more information on implementation rationale, read the
 xref:#rationale_closed_addressing_containers[corresponding section].
 == Open Addressing Implementation
 The diagram shows the basic internal layout of `boost::unordered_flat_map`/`unordered_node_map` and
 `boost:unordered_flat_set`/`unordered_node_set`.
 [#img-foa-layout]
 .Open-addressing layout used by Boost.Unordered.
 image::foa.png[align=center]
 As with all open-addressing containers, elements (or pointers to the element nodes in the case of
 `boost::unordered_node_map` and `boost::unordered_node_set`) are stored directly in the bucket array.
 This array is logically divided into 2^_n_^ _groups_ of 15 elements each.
 In addition to the bucket array, there is an associated _metadata array_ with 2^_n_^
 16-byte words.
 [#img-foa-metadata]
 .Breakdown of a metadata word.
 image::foa-metadata.png[align=center]
 A metadata word is divided into 15 _h_~_i_~ bytes (one for each associated
 bucket), and an _overflow byte_ (_ofw_ in the diagram). The value of _h_~_i_~ is:
  - 0 if the corresponding bucket is empty.
  - 1 to encode a special empty bucket called a _sentinel_, which is used internally to
  stop iteration when the container has been fully traversed.
  - If the bucket is occupied, a _reduced hash value_ obtained from the hash value of
  the element.
 When looking for an element with hash value _h_, SIMD technologies such as
 https://en.wikipedia.org/wiki/SSE2[SSE2] and
 https://en.wikipedia.org/wiki/ARM_architecture_family#Advanced_SIMD_(Neon)[Neon] allow us
 to very quickly inspect the full metadata word and look for the reduced value of _h_ among all the
 15 buckets with just a handful of CPU instructions: non-matching buckets can be
 readily discarded, and those whose reduced hash value matches need be inspected via full
 comparison with the corresponding element. If the looked-for element is not present,
 the overflow byte is inspected:
 - If the bit in the position _h_ mod 8 is zero, lookup terminates (and the
 element is not present).
 - If the bit is set to 1 (the group has been _overflowed_), further groups are
 checked using https://en.wikipedia.org/wiki/Quadratic_probing[_quadratic probing_], and
 the process is repeated.
 Insertion is algorithmically similar: empty buckets are located using SIMD,
 and when going past a full group its corresponding overflow bit is set to 1.
 In architectures without SIMD support, the logical layout stays the same, but the metadata
 word is codified using a technique we call _bit interleaving_: this layout allows us
 to emulate SIMD with reasonably good performance using only standard arithmetic and
 logical operations.
 [#img-foa-metadata-interleaving]
 .Bit-interleaved metadata word.
 image::foa-metadata-interleaving.png[align=center]
 A more detailed description of Boost.Unordered's open-addressing implementation is
 given in an
 https://bannalia.blogspot.com/2022/11/inside-boostunorderedflatmap.html[external article].
 For more information on implementation rationale, read the
 xref:#rationale_open_addresing_containers[corresponding section].
 == Concurrent Open Addressing Implementation
 `boost::concurrent_flat_map` uses the basic
 xref::#buckets_open_addressing_implementation[open-addressing layout] described above
 augmented with synchronization mechanisms.
 [#img-cfoa-layout]
 .Concurrent open-addressing layout used by Boost.Unordered.
 image::cfoa.png[align=center]
 Two levels of synchronization are used:
 * Container level: A read-write mutex is used to control access from any operation
 to the container. Typically, such access is in read mode (that is, concurrent) even
 for modifying operations, so for most practical purposes there is no thread
 contention at this level. Access is only in write mode (blocking) when rehashing or
 performing container-wide operations such as swapping or assignment.
 * Group level: Each 15-slot group is equipped with an 8-byte word containing:
  ** A read-write spinlock for synchronized access to any element in the group.
  ** An atomic _insertion counter_ used for optimistic insertion as described
  below.
 By using atomic operations to access the group metadata, lookup is (group-level)
 lock-free up to the point where an actual comparison needs to be done with an element
 that has been previously SIMD-matched: only then it's the group's spinlock used.
 Insertion uses the following _optimistic algorithm_:
 * The value of the insertion counter for the initial group in the probe
 sequence is locally recorded (let's call this value `c0`).
 * Lookup is as described above. If lookup finds no equivalent element,
 search for an available slot for insertion successively locks/unlocks
 each group in the probing sequence.
 * When an available slot is located, it is preemptively occupied (its
 reduced hash value is set) and the insertion counter is atomically
 incremented: if no other thread has incremented the counter during the
 whole operation (which is checked by comparing with `c0`), then we're
 good to go and complete the insertion, otherwise we roll back and start
 over.
 This algorithm has very low contention both at the lookup and actual
 insertion phases in exchange for the possibility that computations have
 to be started over if some other thread interferes in the process by
 performing a succesful insertion beginning at the same group. In
 practice, the start-over frequency is extremely small, measured in the range
 of parts per million for some of our benchmarks.
 For more information on implementation rationale, read the
 xref:#rationale_concurrent_hashmap[corresponding section].
--- a/doc/unordered/changes.adoc
+++ b/doc/unordered/changes.adoc
@@ -6,8 +6,9 @@
 :github-pr-url: https://github.com/boostorg/unordered/pull
 :cpp: C++
-== Release 1.83.0
+== Release 1.83.0 - Major update
 * Added `boost::concurrent_flat_map`, a fast, thread-safe hashmap based on open addressing.
 * Sped up iteration of open-addressing containers.
 == Release 1.82.0 - Major update
--- a/doc/unordered/compliance.adoc
+++ b/doc/unordered/compliance.adoc
@@ -5,7 +5,7 @@
 :cpp: C++
-== Closed-addressing containers
+== Closed-addressing Containers
 `unordered_[multi]set` and `unordered_[multi]map` are intended to provide a conformant
 implementation of the {cpp}20 standard that will work with {cpp}98 upwards.
@@ -13,7 +13,7 @@ This wide compatibility does mean some compromises have to be made.
 With a compiler and library that fully support {cpp}11, the differences should
 be minor.
-=== Move emulation
+=== Move Emulation
 Support for move semantics is implemented using Boost.Move. If rvalue
 references are available it will use them, but if not it uses a close,
@@ -25,7 +25,7 @@ but imperfect emulation. On such compilers:
 * The containers themselves are not movable.
 * Argument forwarding is not perfect.
-=== Use of allocators
+=== Use of Allocators
 {cpp}11 introduced a new allocator system. It's backwards compatible due to
 the lax requirements for allocators in the old standard, but might need
@@ -58,7 +58,7 @@ Due to imperfect move emulation, some assignments might check
 `propagate_on_container_copy_assignment` on some compilers and
 `propagate_on_container_move_assignment` on others.
-=== Construction/Destruction using allocators
+=== Construction/Destruction Using Allocators
 The following support is required for full use of {cpp}11 style
 construction/destruction:
@@ -117,7 +117,7 @@ Variadic constructor arguments for `emplace` are only used when both
 rvalue references and variadic template parameters are available.
 Otherwise `emplace` can only take up to 10 constructors arguments.
-== Open-addressing containers
+== Open-addressing Containers
 The C++ standard does not currently provide any open-addressing container
 specification to adhere to, so `boost::unordered_flat_set`/`unordered_node_set` and
@@ -144,7 +144,7 @@ The main differences with C++ unordered associative containers are:
  ** Pointer stability is not kept under rehashing.
  ** There is no API for node extraction/insertion.
-== Concurrent Hashmap
+== Concurrent Containers
 There is currently no specification in the C++ standard for this or any other concurrent
 data structure. `boost::concurrent_flat_map` takes the same template parameters as `std::unordered_map`
--- a/doc/unordered/concurrent_flat_map_intro.adoc
+++ b/doc/unordered/concurrent_flat_map_intro.adoc
@@ -1,8 +1,9 @@
-[#concurrent_flat_map_intro]
+[#concurrent]
-= An introduction to boost::concurrent_flat_map
+= Concurrent Containers
-:idprefix: concurrent_flat_map_intro_
+:idprefix: concurrent_
 Boost.Unordered currently provides just one concurrent container named `boost::concurrent_flat_map`.
 `boost::concurrent_flat_map` is a hash table that allows concurrent write/read access from
 different threads without having to implement any synchronzation mechanism on the user's side.
@@ -131,7 +132,7 @@ by using `cvisit` overloads (for instance, `insert_or_cvisit`) and may result
 in higher parallelization. Consult the xref:#concurrent_flat_map[reference]
 for a complete list of available operations.
-== Whole-table visitation
+== Whole-table Visitation
 In the absence of iterators, `boost::concurrent_flat_map` provides `visit_all`
 as an alternative way to process all the elements in the map:
@@ -168,7 +169,7 @@ may be inserted, modified or erased by other threads during visitation. It is
 advisable not to assume too much about the exact global state of a `boost::concurrent_flat_map`
 at any point in your program.
-== Blocking operations
+== Blocking Operations
 ``boost::concurrent_flat_map``s can be copied, assigned, cleared and merged just like any
 Boost.Unordered container. Unlike most other operations, these are _blocking_,
@@ -177,5 +178,5 @@ clear or merge operation is in progress. Blocking is taken care of automatically
 and the user need not take any special precaution, but overall performance may be affected.
 Another blocking operation is _rehashing_, which happens explicitly via `rehash`/`reserve`
-or during insertion when the table's load hits `max_load()`. As with non-concurrent hashmaps,
+or during insertion when the table's load hits `max_load()`. As with non-concurrent containers,
 reserving space in advance of bulk insertions will generally speed up the process.
--- a/doc/unordered/intro.adoc
+++ b/doc/unordered/intro.adoc
@@ -4,146 +4,22 @@
 :idprefix: intro_
 :cpp: C++
-For accessing data based on key lookup, the {cpp} standard library offers `std::set`,
+link:https://en.wikipedia.org/wiki/Hash_table[Hash tables^] are extremely popular
-`std::map`, `std::multiset` and `std::multimap`. These are generally
+computer data structures and can be found under one form or another in virtually any programming
-implemented using balanced binary trees so that lookup time has
+language. Whereas other associative structures such as rb-trees (used in {cpp} by `std::set` and `std::map`)
-logarithmic complexity. That is generally okay, but in many cases a
+have logarithmic-time complexity for insertion and lookup, hash tables, if configured properly,
-link:https://en.wikipedia.org/wiki/Hash_table[hash table^] can perform better, as accessing data has constant complexity,
+perform these operations in constant time on average, and are generally much faster.
 on average. The worst case complexity is linear, but that occurs rarely and
 with some care, can be avoided.
-Also, the existing containers require a 'less than' comparison object
+{cpp} introduced __unordered associative containers__ `std::unordered_set`, `std::unordered_map`,
-to order their elements. For some data types this is impossible to implement
+`std::unordered_multiset` and `std::unordered_multimap` in {cpp}11, but research on hash tables
-or isn't practical. In contrast, a hash table only needs an equality function
+hasn't stopped since: advances in CPU architectures such as
-and a hash function for the key.
+more powerful caches, link:https://en.wikipedia.org/wiki/Single_instruction,_multiple_data[SIMD] operations
 and increasingly available link:https://en.wikipedia.org/wiki/Multi-core_processor[multicore processors]
 open up possibilities for improved hash-based data structures and new use cases that
 are simply beyond reach of unordered associative containers as specified in 2011.
-With this in mind, unordered associative containers were added to the {cpp}
+Boost.Unordered offers a catalog of hash containers with different standards compliance levels,
-standard. Boost.Unordered provides an implementation of the containers described in {cpp}11,
+performances and intented usage scenarios:
 with some <<compliance,deviations from the standard>> in
 order to work with non-{cpp}11 compilers and libraries.
 `unordered_set` and `unordered_multiset` are defined in the header
 `<boost/unordered/unordered_set.hpp>`
 [source,c++]
 ----  
 namespace boost {
    template <
        class Key,
        class Hash = boost::hash<Key>,
        class Pred = std::equal_to<Key>,
        class Alloc = std::allocator<Key> >
    class unordered_set;
    template<
        class Key,
        class Hash = boost::hash<Key>, 
        class Pred = std::equal_to<Key>, 
        class Alloc = std::allocator<Key> > 
    class unordered_multiset;
 }
 ----
 `unordered_map` and `unordered_multimap` are defined in the header
 `<boost/unordered/unordered_map.hpp>`
 [source,c++]
 ----
 namespace boost {
    template <
        class Key, class Mapped,
        class Hash = boost::hash<Key>,
        class Pred = std::equal_to<Key>,
        class Alloc = std::allocator<std::pair<Key const, Mapped> > >
    class unordered_map;
    template<
        class Key, class Mapped,
        class Hash = boost::hash<Key>,
        class Pred = std::equal_to<Key>,
        class Alloc = std::allocator<std::pair<Key const, Mapped> > >
    class unordered_multimap;
 }
 ----
 These containers, and all other implementations of standard unordered associative
 containers, use an approach to its internal data structure design called
 *closed addressing*. Starting in Boost 1.81, Boost.Unordered also provides containers
 `boost::unordered_flat_set` and `boost::unordered_flat_map`, which use a
 different data structure strategy commonly known as *open addressing* and depart in
 a small number of ways from the standard so as to offer much better performance
 in exchange (more than 2 times faster in typical scenarios):
 [source,c++]
 ----
 // #include <boost/unordered/unordered_flat_set.hpp>
 //
 // Note: no multiset version
 namespace boost {
    template <
        class Key,
        class Hash = boost::hash<Key>,
        class Pred = std::equal_to<Key>,
        class Alloc = std::allocator<Key> >
    class unordered_flat_set;
 }
 ----
 [source,c++]
 ----
 // #include <boost/unordered/unordered_flat_map.hpp>
 //
 // Note: no multimap version
 namespace boost {
    template <
        class Key, class Mapped,
        class Hash = boost::hash<Key>,
        class Pred = std::equal_to<Key>,
        class Alloc = std::allocator<std::pair<Key const, Mapped> > >
    class unordered_flat_map;
 }
 ----
 Starting in Boost 1.82, the containers `boost::unordered_node_set` and `boost::unordered_node_map`
 are introduced: they use open addressing like `boost::unordered_flat_set` and `boost::unordered_flat_map`,
 but internally store element _nodes_, like `boost::unordered_set` and `boost::unordered_map`,
 which provide stability of pointers and references to the elements:
 [source,c++]
 ----
 // #include <boost/unordered/unordered_node_set.hpp>
 //
 // Note: no multiset version
 namespace boost {
    template <
        class Key,
        class Hash = boost::hash<Key>,
        class Pred = std::equal_to<Key>,
        class Alloc = std::allocator<Key> >
    class unordered_node_set;
 }
 ----
 [source,c++]
 ----
 // #include <boost/unordered/unordered_node_map.hpp>
 //
 // Note: no multimap version
 namespace boost {
    template <
        class Key, class Mapped,
        class Hash = boost::hash<Key>,
        class Pred = std::equal_to<Key>,
        class Alloc = std::allocator<std::pair<Key const, Mapped> > >
    class unordered_node_map;
 }
 ----
 These are all the containers provided by Boost.Unordered:
 [caption=, title='Table {counter:table-counter}. Boost.Unordered containers']
 [cols="1,1,.^1", frame=all, grid=rows]
@@ -165,44 +41,49 @@ These are all the containers provided by Boost.Unordered:
 ^| `boost::unordered_flat_set` +
 `boost::unordered_flat_map`
 ^.^h|*Concurrent*
 ^|
 ^| `boost::concurrent_flat_map`
 |===
-Closed-addressing containers are pass:[C++]98-compatible. Open-addressing containers require a
+* **Closed-addressing containers** are fully compliant with the C++ specification
-reasonably compliant pass:[C++]11 compiler.
+for unordered associative containers and feature one of the fastest implementations
 in the market within the technical constraints imposed by the required standard interface.
 * **Open-addressing containers** rely on much faster data structures and algorithms
 (more than 2 times faster in typical scenarios) while slightly diverging from the standard
 interface to accommodate the implementation.
 There are two variants: **flat** (the fastest) and **node-based**, which 
 provide pointer stability under rehashing at the expense of being slower.
 * Finally, `boost::concurrent_flat_map` (the only **concurrent container** provided
 at present) is a hashmap designed and implemented to be used in high-performance
 multithreaded scenarios. Its interface is radically different from that of regular C++ containers.
-Boost.Unordered containers are used in a similar manner to the normal associative
+All sets and maps in Boost.Unordered are instantiatied similarly as
-containers:
+`std::unordered_set` and `std::unordered_map`, respectively:
 [source,cpp]
 ----
 typedef boost::unordered_map<std::string, int> map;
 map x;
 x["one"] = 1;
 x["two"] = 2;
 x["three"] = 3;
 assert(x.at("one") == 1);
 assert(x.find("missing") == x.end());
 ----
 But since the elements aren't ordered, the output of:
 [source,c++]
----
+----  
-for(const map::value_type& i: x) {
+namespace boost {
-    std::cout<<i.first<<","<<i.second<<"\n";
+    template <
        class Key,
        class Hash = boost::hash<Key>,
        class Pred = std::equal_to<Key>,
        class Alloc = std::allocator<Key> >
    class unordered_set; 
    // same for unordered_multiset, unordered_flat_set, unordered_node_set
    template <
        class Key, class Mapped,
        class Hash = boost::hash<Key>,
        class Pred = std::equal_to<Key>,
        class Alloc = std::allocator<std::pair<Key const, Mapped> > >
    class unordered_map;
    // same for unordered_multimap, unordered_flat_map, unordered_node_map
    // and concurrent_flat_map
 }
 ----
 can be in any order. For example, it might be:
 [source]
 ----
 two,2
 one,1
 three,3
 ----
 To store an object in an unordered associative container requires both a
 key equality function and a hash function. The default function objects in
 the standard containers support a few basic types including integer types,
@@ -213,16 +94,3 @@ you have to extend Boost.Hash to support the type or use
 your own custom equality predicates and hash functions. See the
 <<hash_equality,Equality Predicates and Hash Functions>> section
 for more details.
 There are other differences, which are listed in the
 <<comparison,Comparison with Associative Containers>> section.
 == A concurrent hashmap
 Starting in Boost 1.83, Boost.Unordered provides `boost::concurrent_flat_map`,
 a thread-safe hash table for high performance multithreaded scenarios. Although
 it shares the internal data structure and most of the algorithms with Boost.Unordered
 open-addressing `boost::unordered_flat_map`, ``boost::concurrent_flat_map``'s API departs significantly
 from that of C++ unordered associative containers to make this table suitable for
 concurrent usage. Consult the xref:#concurrent_flat_map_intro[dedicated tutorial]
 for more information.
--- a/doc/unordered/rationale.adoc
+++ b/doc/unordered/rationale.adoc
@@ -4,7 +4,7 @@
 = Implementation Rationale
-== Closed-addressing containers 
+== Closed-addressing Containers
 `boost::unordered_[multi]set` and `boost::unordered_[multi]map`
 adhere to the standard requirements for unordered associative
@@ -74,7 +74,7 @@ Since release 1.80.0, prime numbers are chosen for the number of buckets in
 tandem with sophisticated modulo arithmetic. This removes the need for "mixing"
 the result of the user's hash function as was used for release 1.79.0.
-== Open-addresing containers 
+== Open-addresing Containers 
 The C++ standard specification of unordered associative containers impose
 severe limitations on permissible implementations, the most important being
@@ -86,7 +86,7 @@ The design of `boost::unordered_flat_set`/`unordered_node_set` and `boost::unord
 guided by Peter Dimov's https://pdimov.github.io/articles/unordered_dev_plan.html[Development Plan for Boost.Unordered^].
 We discuss here the most relevant principles.
-=== Hash function
+=== Hash Function
 Given its rich functionality and cross-platform interoperability,
 `boost::hash` remains the default hash function of open-addressing containers.
@@ -105,7 +105,7 @@ whereas in 32 bits _C_ = 0xE817FB2Du has been obtained from https://arxiv.org/ab
 When using a hash function directly suitable for open addressing, post-mixing can be opted out by via a dedicated <<hash_traits_hash_is_avalanching,`hash_is_avalanching`>>trait.
 `boost::hash` specializations for string types are marked as avalanching.
-=== Platform interoperability
+=== Platform Interoperability
 The observable behavior of `boost::unordered_flat_set`/`unordered_node_set` and `boost::unordered_flat_map`/`unordered_node_map` is deterministically
 identical across different compilers as long as their ``std::size_t``s are the same size and the user-provided
@@ -118,7 +118,7 @@ and https://en.wikipedia.org/wiki/ARM_architecture_family#Advanced_SIMD_(NEON)[N
 this does not affect interoperatility. For instance, the behavior is the same
 for Visual Studio on an x64-mode Intel CPU with SSE2 and for GCC on an IBM s390x without any supported SIMD technology.
-== Concurrent Hashmap
+== Concurrent Containers
 The same data structure used by Boost.Unordered open-addressing containers has been chosen
 also as the foundation of `boost::concurrent_flat_map`:
@@ -132,7 +132,7 @@ lookup that are lock-free up to the last step of actual element comparison.
 of all elements between `boost::concurrent_flat_map` and `boost::unordered_flat_map`.
 (This feature has not been implemented yet.)
-=== Hash function and platform interoperability
+=== Hash Function and Platform Interoperability
 `boost::concurrent_flat_map` makes the same decisions and provides the same guarantees
 as Boost.Unordered open-addressing containers with regards to 
--- a/doc/unordered/comparison.adoc
+++ b/doc/unordered/comparison.adoc
@@ -1,8 +1,99 @@
 [#regular]
 = Regular Containers
 :idprefix: regular_
 Boost.Unordered closed-addressing containers (`boost::unordered_set`, `boost::unordered_map`,
 `boost::unordered_multiset` and `boost::unordered_multimap`) are fully conformant with the
 C++ specification for unordered associative containers, so for those who know how to use 
 `std::unordered_set`, `std::unordered_map`, etc., their homonyms in Boost:Unordered are
 drop-in replacements. The interface of open-addressing containers (`boost::unordered_node_set`, 
 `boost::unordered_node_map`, `boost::unordered_flat_set` and `boost::unordered_flat_map`)
 is very similar, but they present some minor differences listed in the dedicated
 xref:#compliance_open_addressing_containers[standard compliance section].
 For readers without previous experience with hash containers but familiar
 with normal associatve containers (`std::set`, `std::map`,
 `std::multiset` and `std::multimap`), Boost.Unordered containers are used in a similar manner:
 [source,cpp]
 ----
 typedef boost::unordered_map<std::string, int> map;
 map x;
 x["one"] = 1;
 x["two"] = 2;
 x["three"] = 3;
 assert(x.at("one") == 1);
 assert(x.find("missing") == x.end());
 ----
 But since the elements aren't ordered, the output of:
 [source,c++]
 ----
 for(const map::value_type& i: x) {
    std::cout<<i.first<<","<<i.second<<"\n";
 }
 ----
 can be in any order. For example, it might be:
 [source]
 ----
 two,2
 one,1
 three,3
 ----
 There are other differences, which are listed in the
 <<comparison,Comparison with Associative Containers>> section.
 == Iterator Invalidation
 It is not specified how member functions other than `rehash` and `reserve` affect
 the bucket count, although `insert` can only invalidate iterators
 when the insertion causes the container's load to be greater than the maximum allowed.
 For most implementations this means that `insert` will only
 change the number of buckets when this happens. Iterators can be
 invalidated by calls to `insert`, `rehash` and `reserve`.
 As for pointers and references,
 they are never invalidated for node-based containers 
 (`boost::unordered_[multi]set`, `boost::unordered_[multi]map`, `boost::unordered_node_set`, `boost::unordered_node_map`),
 but they will when rehashing occurs for
 `boost::unordered_flat_set` and `boost::unordered_flat_map`: this is because
 these containers store elements directly into their holding buckets, so
 when allocating a new bucket array the elements must be transferred by means of move construction.
 In a similar manner to using `reserve` for ``vector``s, it can be a good idea
 to call `reserve` before inserting a large number of elements. This will get
 the expensive rehashing out of the way and let you store iterators, safe in
 the knowledge that they won't be invalidated. If you are inserting `n`
 elements into container `x`, you could first call:
 ```
 x.reserve(n);
 ```
 Note:: `reserve(n)` reserves space for at least `n` elements, allocating enough buckets
 so as to not exceed the maximum load factor.
 +
 Because the maximum load factor is defined as the number of elements divided by the total
 number of available buckets, this function is logically equivalent to:
 +
 ```
 x.rehash(std::ceil(n / x.max_load_factor()))
 ```
 +
 See the <<unordered_map_rehash,reference for more details>> on the `rehash` function.
 [#comparison]
 :idprefix: comparison_
-= Comparison with Associative Containers
+== Comparison with Associative Containers
 [caption=, title='Table {counter:table-counter} Interface differences']
 [cols="1,1", frame=all, grid=rows]
@@ -32,7 +123,7 @@
 |`iterator`, `const_iterator` are of at least the forward category.
 |Iterators, pointers and references to the container's elements are never invalidated.
-|<<buckets_iterator_invalidation,Iterators can be invalidated by calls to insert or rehash>>. +
+|<<regular_iterator_invalidation,Iterators can be invalidated by calls to insert or rehash>>. +
 **Node-based containers:** Pointers and references to the container's elements are never invalidated. +
 **Flat containers:** Pointers and references to the container's elements are invalidated when rehashing occurs.
--- a/doc/unordered/structures.adoc
+++ b/doc/unordered/structures.adoc
@@ -0,0 +1,179 @@
 [#structures]
 = Data Structures
 :idprefix: structures_
 == Closed-addressing Containers
 ++++
 <style>
  .imageblock > .title {
    text-align: inherit;
  }
 </style>
 ++++
 Boost.Unordered sports one of the fastest implementations of closed addressing, also commonly known as https://en.wikipedia.org/wiki/Hash_table#Separate_chaining[separate chaining]. An example figure representing the data structure is below:
 [#img-bucket-groups,.text-center]
 .A simple bucket group approach
 image::bucket-groups.png[align=center]
 An array of "buckets" is allocated and each bucket in turn points to its own individual linked list. This makes meeting the standard requirements of bucket iteration straight-forward. Unfortunately, iteration of the entire container is often times slow using this layout as each bucket must be examined for occupancy, yielding a time complexity of `O(bucket_count() + size())` when the standard requires complexity to be `O(size())`.
 Canonical standard implementations will wind up looking like the diagram below:
 [.text-center]
 .The canonical standard approach
 image::singly-linked.png[align=center,link=../diagrams/singly-linked.png,window=_blank]
 It's worth noting that this approach is only used by pass:[libc++] and pass:[libstdc++]; the MSVC Dinkumware implementation uses a different one. A more detailed analysis of the standard containers can be found http://bannalia.blogspot.com/2013/10/implementation-of-c-unordered.html[here].
 This unusually laid out data structure is chosen to make iteration of the entire container efficient by inter-connecting all of the nodes into a singly-linked list. One might also notice that buckets point to the node _before_ the start of the bucket's elements. This is done so that removing elements from the list can be done efficiently without introducing the need for a doubly-linked list. Unfortunately, this data structure introduces a guaranteed extra indirection. For example, to access the first element of a bucket, something like this must be done:
 ```c++
 auto const idx = get_bucket_idx(hash_function(key));
 node* p = buckets[idx]; // first load
 node* n = p->next; // second load
 if (n && is_in_bucket(n, idx)) {
  value_type const& v = *n; // third load
  // ...
 }
 ```
 With a simple bucket group layout, this is all that must be done:
 ```c++
 auto const idx = get_bucket_idx(hash_function(key));
 node* n = buckets[idx]; // first load
 if (n) {
  value_type const& v = *n; // second load
  // ...
 }
 ```
 In practice, the extra indirection can have a dramatic performance impact to common operations such as `insert`, `find` and `erase`. But to keep iteration of the container fast, Boost.Unordered introduces a novel data structure, a "bucket group". A bucket group is a fixed-width view of a subsection of the buckets array. It contains a bitmask (a `std::size_t`) which it uses to track occupancy of buckets and contains two pointers so that it can form a doubly-linked list with non-empty groups. An example diagram is below:
 [#img-fca-layout]
 .The new layout used by Boost
 image::fca.png[align=center]
 Thus container-wide iteration is turned into traversing the non-empty bucket groups (an operation with constant time complexity) which reduces the time complexity back to `O(size())`. In total, a bucket group is only 4 words in size and it views `sizeof(std::size_t) * CHAR_BIT` buckets meaning that for all common implementations, there's only 4 bits of space overhead per bucket introduced by the bucket groups.
 A more detailed description of Boost.Unordered's closed-addressing implementation is
 given in an
 https://bannalia.blogspot.com/2022/06/advancing-state-of-art-for.html[external article].
 For more information on implementation rationale, read the
 xref:#rationale_closed_addressing_containers[corresponding section].
 == Open-addressing Containers
 The diagram shows the basic internal layout of `boost::unordered_flat_map`/`unordered_node_map` and
 `boost:unordered_flat_set`/`unordered_node_set`.
 [#img-foa-layout]
 .Open-addressing layout used by Boost.Unordered.
 image::foa.png[align=center]
 As with all open-addressing containers, elements (or pointers to the element nodes in the case of
 `boost::unordered_node_map` and `boost::unordered_node_set`) are stored directly in the bucket array.
 This array is logically divided into 2^_n_^ _groups_ of 15 elements each.
 In addition to the bucket array, there is an associated _metadata array_ with 2^_n_^
 16-byte words.
 [#img-foa-metadata]
 .Breakdown of a metadata word.
 image::foa-metadata.png[align=center]
 A metadata word is divided into 15 _h_~_i_~ bytes (one for each associated
 bucket), and an _overflow byte_ (_ofw_ in the diagram). The value of _h_~_i_~ is:
  - 0 if the corresponding bucket is empty.
  - 1 to encode a special empty bucket called a _sentinel_, which is used internally to
  stop iteration when the container has been fully traversed.
  - If the bucket is occupied, a _reduced hash value_ obtained from the hash value of
  the element.
 When looking for an element with hash value _h_, SIMD technologies such as
 https://en.wikipedia.org/wiki/SSE2[SSE2] and
 https://en.wikipedia.org/wiki/ARM_architecture_family#Advanced_SIMD_(Neon)[Neon] allow us
 to very quickly inspect the full metadata word and look for the reduced value of _h_ among all the
 15 buckets with just a handful of CPU instructions: non-matching buckets can be
 readily discarded, and those whose reduced hash value matches need be inspected via full
 comparison with the corresponding element. If the looked-for element is not present,
 the overflow byte is inspected:
 - If the bit in the position _h_ mod 8 is zero, lookup terminates (and the
 element is not present).
 - If the bit is set to 1 (the group has been _overflowed_), further groups are
 checked using https://en.wikipedia.org/wiki/Quadratic_probing[_quadratic probing_], and
 the process is repeated.
 Insertion is algorithmically similar: empty buckets are located using SIMD,
 and when going past a full group its corresponding overflow bit is set to 1.
 In architectures without SIMD support, the logical layout stays the same, but the metadata
 word is codified using a technique we call _bit interleaving_: this layout allows us
 to emulate SIMD with reasonably good performance using only standard arithmetic and
 logical operations.
 [#img-foa-metadata-interleaving]
 .Bit-interleaved metadata word.
 image::foa-metadata-interleaving.png[align=center]
 A more detailed description of Boost.Unordered's open-addressing implementation is
 given in an
 https://bannalia.blogspot.com/2022/11/inside-boostunorderedflatmap.html[external article].
 For more information on implementation rationale, read the
 xref:#rationale_open_addresing_containers[corresponding section].
 == Concurrent Containers
 `boost::concurrent_flat_map` uses the basic
 xref:#structures_open_addressing_containers[open-addressing layout] described above
 augmented with synchronization mechanisms.
 [#img-cfoa-layout]
 .Concurrent open-addressing layout used by Boost.Unordered.
 image::cfoa.png[align=center]
 Two levels of synchronization are used:
 * Container level: A read-write mutex is used to control access from any operation
 to the container. Typically, such access is in read mode (that is, concurrent) even
 for modifying operations, so for most practical purposes there is no thread
 contention at this level. Access is only in write mode (blocking) when rehashing or
 performing container-wide operations such as swapping or assignment.
 * Group level: Each 15-slot group is equipped with an 8-byte word containing:
  ** A read-write spinlock for synchronized access to any element in the group.
  ** An atomic _insertion counter_ used for optimistic insertion as described
  below.
 By using atomic operations to access the group metadata, lookup is (group-level)
 lock-free up to the point where an actual comparison needs to be done with an element
 that has been previously SIMD-matched: only then it's the group's spinlock used.
 Insertion uses the following _optimistic algorithm_:
 * The value of the insertion counter for the initial group in the probe
 sequence is locally recorded (let's call this value `c0`).
 * Lookup is as described above. If lookup finds no equivalent element,
 search for an available slot for insertion successively locks/unlocks
 each group in the probing sequence.
 * When an available slot is located, it is preemptively occupied (its
 reduced hash value is set) and the insertion counter is atomically
 incremented: if no other thread has incremented the counter during the
 whole operation (which is checked by comparing with `c0`), then we're
 good to go and complete the insertion, otherwise we roll back and start
 over.
 This algorithm has very low contention both at the lookup and actual
 insertion phases in exchange for the possibility that computations have
 to be started over if some other thread interferes in the process by
 performing a succesful insertion beginning at the same group. In
 practice, the start-over frequency is extremely small, measured in the range
 of parts per million for some of our benchmarks.
 For more information on implementation rationale, read the
 xref:#rationale_concurrent_containers[corresponding section].