diff --git a/doc/buckets.qbk b/doc/buckets.qbk new file mode 100644 index 00000000..7fe0f95f --- /dev/null +++ b/doc/buckets.qbk @@ -0,0 +1,144 @@ +[section:buckets The Data Structure] + +The containers are made up of a number of 'buckets', each of which can contain +any number of elements. For example, the following +diagram shows an [classref boost::unordered_set unordered_set] with 7 +buckets containing 5 elements, `A`, `B`, `C`, `D` and `E` +(this is just for illustrations, the containers have more buckets, even when +empty). + +[$../diagrams/buckets.png] + +In order to decide which bucket to place an element in, the container +applies `Hash` to the element (for maps it applies it to the element's `Key` +part). This gives a `std::size_t`. `std::size_t` has a much greater range of +values then the number of buckets, so that container applies another +transformation to that value to choose a bucket (in the case of +[classref boost::unordered_set] this is just the modulous of the number of +buckets). + +If at a later date the container wants to find an element in the container +it just has to apply the same process to the element (or key for maps) to +discover which bucket to find it in. This means that you only have to look at +the elements within a bucket when searching, and if the hash function has +worked well an evenly distributed the elements among the buckets, this should +be a small number. + +You can see in the diagram that `A` & `D` have been placed in the same bucket. +This means that when looking in this bucket, up to 2 comparison have to be +made, making searching slower. This is known as a collision. To keep things +fast we try to keep these to a minimum. + +[table Methods for Accessing Buckets + [[Method] [Description]] + + [ + [``size_type bucket_count() const``] + [The number of buckets.] + ] + [ + [``size_type max_bucket_count() const``] + [An upper bound on the number of buckets.] + ] + [ + [``size_type bucket_size(size_type n) const``] + [The number of elements in bucket `n`.] + ] + [ + [`` + local_iterator begin(size_type n); + local_iterator end(size_type n); + const_local_iterator begin(size_type n) const; + const_local_iterator end(size_type n) const; + ``] + [Return begin and end iterators for bucket `n`.] + ] +] + +[h2 Controlling the number of buckets] + +As more elements are added to an unordered associative container, the number +of elements in the buckets will increase causing performance to get worse. To +combat this the containers increase the bucket count as elements are inserted. + +The standard gives you two methods to influence the bucket count. First you can +specify the minimum number of buckets in the constructor, and later, by calling +`rehash`. + +The other method is the `max_load_factor` member function. This lets you +/hint/ at the maximum load that the buckets should hold. +The 'load factor' is the average number of elements per bucket, +the container tries to keep this below the maximum load factor, which is +initially set to 1.0. +`max_load_factor` tells the container to change the maximum load factor, +using your supplied hint as a suggestion. + +TR1 doesn't actually require the container to pay much attention to this +value. The only time the load factor is required to be less than the maximum +is following a call to `rehash`. + +It is not specified anywhere how other member functions affect the bucket count. +But most implementations will invalidate the iterators whenever they change +the bucket count - which is only allowed when an +`insert` causes the load factor to be more than or equal to the maximum. +But it is possible to implement the containers such that the iterators are +never invalidated. + +(TODO: This might not be right. I'm not sure what is allowed for +std::unordered_set and std::unordered_map when insert is called with enough +elements to exceed the maximum, but the maximum isn't exceeded because +the elements are already in the container) + +This all sounds quite gloomy, but it's not that bad. Most implementations +will probably respect the maximum load factor hint. This implementation +certainly does. + +[table Methods for Controlling Bucket Size + [[Method] [Description]] + + [ + [``float load_factor() const``] + [The average number of elements per bucket.] + ] + [ + [``float max_load_factor() const``] + [Returns the current maximum load factor.] + ] + [ + [``float max_load_factor(float z)``] + [Changes the container's maximum load factor, using `z` as a hint.] + ] + [ + [``void rehash(size_type n)``] + [Changes the number of buckets so that there at least n buckets, and + so that the load factor is less than the maximum load factor.] + ] + +] + +[h2 Rehash Techniques] + +If the container has a load factor much smaller than the maximum, `rehash` +might decrease the number of buckets, reducing the memory usage. This isn't +guaranteed by the standard but this implementation will do it. + +When inserting many elements, it is a good idea to first call `rehash` to +make sure you have enough buckets. This will get the expensive rehashing out +of the way and let you store iterators, safe in the knowledge that they +won't be invalidated. If you are inserting `n` elements into container `x`, +you could first call: + + x.rehash((x.size() + n) / x.max_load_factor() + 1); + +If you want to stop the table from ever rehashing due to an insert, you can +set the maximum load factor to infinity (or perhaps a load factor that it'll +never reach - say `x.max_size()`. As you can only give a 'hint' for the maximum +load factor, this isn't guaranteed to work. But again, it'll work in this +implementation. + +If you do this and want to make the container rehash, `rehash` will still work. +But be careful that you only ever call it with a sufficient number of buckets +- otherwise it's very likely that the container will decrease the bucket +count to an overly small amount. + +[endsect] diff --git a/doc/comparison.qbk b/doc/comparison.qbk index b60c5ada..b9cbc562 100644 --- a/doc/comparison.qbk +++ b/doc/comparison.qbk @@ -27,20 +27,25 @@ might output: two,2 one,1 three,3 + missing,0 -while if `std::set` was used it would be ordered lexicographically. +or the same elements in any other order. If `std::set` was used it would always +be ordered lexicographically. The containers automatically grow as more elements are inserted, this can cause the order to change and iterators to be invalidated (unlike associative containers whose iterators are only invalidated when their elements are erased). -So containers containing identical elements aren't guaranteed to -contain them in the same order. For this reason equality and inequality -operators (which in the STL are defined in terms of sequence equality) -aren't available. +In the STL, the comparion operators for containers are defined in terms +of comparing the sequence of elements. As the elements' order is unpredictable +this would be nonsensical, so the comparison operators aren't defined. -[section Equality Predicate and Hash Functions] + + +[endsect] + +[section Equality Predicates and Hash Functions] [/TODO: A better introduction to hash functions?] @@ -63,5 +68,3 @@ it. For example, if you wanted to use [endsect] - -[endsect] diff --git a/doc/diagrams/buckets.dia b/doc/diagrams/buckets.dia new file mode 100644 index 00000000..f5e76e69 Binary files /dev/null and b/doc/diagrams/buckets.dia differ diff --git a/doc/diagrams/buckets.png b/doc/diagrams/buckets.png new file mode 100644 index 00000000..6d8ecdab Binary files /dev/null and b/doc/diagrams/buckets.png differ diff --git a/doc/intro.qbk b/doc/intro.qbk index 8457b2ff..7addd14f 100644 --- a/doc/intro.qbk +++ b/doc/intro.qbk @@ -1,5 +1,7 @@ [def __tr1__ [@http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1745.pdf C++ Standard Library Technical Report]] +[def __hash-table__ [@http://en.wikipedia.org/wiki/Hash_table + hash table]] [def __hash-function__ [@http://en.wikipedia.org/wiki/Hash_function hash function]] @@ -7,28 +9,69 @@ For accessing data based on keys, the C++ standard library offers `std::set`, `std::map`, `std::multiset` and `std::multimap`. These are generally -implemented using balanced binary trees so accessing data by key is -consistently of logarithmic complexity. Which is generally okay, but not great. +implemented using balanced binary trees so lookup time has +logarithmic complexity. Which is generally okay, but in many cases a +__hash-table__ can perform better, as accessing data has constant complexity, +on average. The worst case complexity is linear, but that occurs rarely and +with some care, can be avoided. -Also, they require their elements to be ordered, and to supply a 'less than' -comparison object. For some data types this is impractacle, It might be slow -to calculate, or even impossible. +Also, the existing containers require a 'less than' comparison object +to order their elements. For some data types this is impracticle. +It might be slow to calculate, or even impossible. On the other hand, in a hash +table, then elements aren't ordered - but you need an equality function +and a hash function for the key. -So the __tr1__ provides unordered associative containers. These will store data -with no ordering, and typically allow for fast constant time access. Their -worst case complexity is linear, but this is generally rare, and with -care can be avoided. There are four containers to match the existing -associate containers: -[classref boost::unordered_set unordered_set], -[classref boost::unordered_map unordered_map], -[classref boost::unordered_multiset unordered_multiset] and -[classref boost::unordered_multimap unordered_multimap]. +So the __tr1__ provides the unordered associative containers, which are +implemented using hash tables. There are four containers to match the existing +associate containers. In the header <[headerref boost/unordered_set.hpp]>: -The fast lookup speeds are acheived using a __hash-function__. The basic idea -is that a function is called for the key value, which is used to index the -data. If this hash function is carefully chosen the different keys will usually -get different indicies, so there will only be a very small number of keys for -each index, so a key lookup won't have to look at many keys to find the correct -one. + template < + class Key, + class Hash = boost::hash, + class Pred = std::equal_to, + class Alloc = std::allocator > + class ``[classref boost::unordered_set unordered_set]``; + + template< + class Key, + class Hash = boost::hash, + class Pred = std::equal_to, + class Alloc = std::allocator > + class ``[classref boost::unordered_multiset unordered_multiset]``; + +and in <[headerref boost/unordered_map.hpp]>: + + template < + class Key, class T, + class Hash = boost::hash, + class Pred = std::equal_to, + class Alloc = std::allocator > + class ``[classref boost::unordered_map unordered_map]``; + + template< + class Key, class T, + class Hash = boost::hash, + class Pred = std::equal_to, + class Alloc = std::allocator > + class ``[classref boost::unordered_multimap unordered_multimap]``; + +The containers are used in a similar manner to the normal associative +containers: + + #include <``[headerref boost/unordered_map.hpp]``> + #include + + int main() + { + boost::unordered_map x; + x["one"] = 1; + x["two"] = 2; + x["three"] = 3; + + assert(x["one"] == 1); + assert(x["missing"] == 0); + } + +But there are some major differences, which will be detailed later. [endsect] diff --git a/doc/ref.xml b/doc/ref.xml index eb6653ab..0a8ce52c 100644 --- a/doc/ref.xml +++ b/doc/ref.xml @@ -1,6 +1,4 @@ - - -
+ @@ -75,9 +73,7 @@ The elements are organized into buckets. Keys with the same hash code are stored in the same bucket. - The number of buckets is automatically increased whenever an insert will make the load factor greater than the maximum load factor. It can also change as result of calling rehash. - - When the number of buckets change: iterators are invalidated, the elements can change order, and move to different buckets, but pointers and references to elements remain valid. + The number of buckets can be automatically increased by a call to insert, or as the result of calling rehash. @@ -319,7 +315,7 @@ - Will only rehash if the insert causes the load factor to be greater to or equal to the maximum load factor. + Can invalidate iterators, but only if the insert causes the load factor to be greater to or equal to the maximum load factor. @@ -360,7 +356,7 @@ The standard is fairly vague on the meaning of the hint. But the only practical way to use it, and the only way that Boost.Unordered supports is to point to an existing element with the same value. - Will only rehash if the insert causes the load factor to be greater to or equal to the maximum load factor. + Can invalidate iterators, but only if the insert causes the load factor to be greater to or equal to the maximum load factor. @@ -386,7 +382,7 @@ - Will only rehash if the insert causes the load factor to be greater to or equal to the maximum load factor. + Can invalidate iterators, but only if the insert causes the load factor to be greater to or equal to the maximum load factor. @@ -603,6 +599,11 @@ n < bucket_count() + + The number of elements in bucket + n. + + @@ -692,7 +693,7 @@ float - Changes the container's maximum load factor,using + Changes the container's maximum load factor, using z as a hint. @@ -704,13 +705,13 @@ void Changes the number of buckets so that there at least - n buckets, and so that the load factor is less thanthe maximum load factor. + n buckets, and so that the load factor is less than the maximum load factor. Invalidates iterators, and changes the order of elements - The function has no effect if an exception is throw, unless it is thrown by the container’s hash function or comparison function. + The function has no effect if an exception is thrown, unless it is thrown by the container’s hash function or comparison function. @@ -813,9 +814,7 @@ The elements are organized into buckets. Keys with the same hash code are stored in the same bucket and elements with equivalent keys are stored next to each other. - The number of buckets is automatically increased whenever an insert will make the load factor greater than the maximum load factor. It can also change as result of calling rehash. - - When the number of buckets change: iterators are invalidated, the elements can change order, and move to different buckets, but pointers and references to elements remain valid. + The number of buckets can be automatically increased by a call to insert, or as the result of calling rehash. @@ -1055,7 +1054,7 @@ - Will only rehash if the insert causes the load factor to be greater to or equal to the maximum load factor. + Can invalidate iterators, but only if the insert causes the load factor to be greater to or equal to the maximum load factor. @@ -1096,7 +1095,7 @@ The standard is fairly vague on the meaning of the hint. But the only practical way to use it, and the only way that Boost.Unordered supports is to point to an existing element with the same value. - Will only rehash if the insert causes the load factor to be greater to or equal to the maximum load factor. + Can invalidate iterators, but only if the insert causes the load factor to be greater to or equal to the maximum load factor. @@ -1122,7 +1121,7 @@ - Will only rehash if the insert causes the load factor to be greater to or equal to the maximum load factor. + Can invalidate iterators, but only if the insert causes the load factor to be greater to or equal to the maximum load factor. @@ -1339,6 +1338,11 @@ n < bucket_count() + + The number of elements in bucket + n. + + @@ -1428,7 +1432,7 @@ float - Changes the container's maximum load factor,using + Changes the container's maximum load factor, using z as a hint. @@ -1440,13 +1444,13 @@ void Changes the number of buckets so that there at least - n buckets, and so that the load factor is less thanthe maximum load factor. + n buckets, and so that the load factor is less than the maximum load factor. Invalidates iterators, and changes the order of elements - The function has no effect if an exception is throw, unless it is thrown by the container’s hash function or comparison function. + The function has no effect if an exception is thrown, unless it is thrown by the container’s hash function or comparison function. @@ -1566,9 +1570,7 @@ The elements are organized into buckets. Keys with the same hash code are stored in the same bucket. - The number of buckets is automatically increased whenever an insert will make the load factor greater than the maximum load factor. It can also change as result of calling rehash. - - When the number of buckets change: iterators are invalidated, the elements can change order, and move to different buckets, but pointers and references to elements remain valid. + The number of buckets can be automatically increased by a call to insert, or as the result of calling rehash. @@ -1813,7 +1815,7 @@ - Will only rehash if the insert causes the load factor to be greater to or equal to the maximum load factor. + Can invalidate iterators, but only if the insert causes the load factor to be greater to or equal to the maximum load factor. @@ -1854,7 +1856,7 @@ The standard is fairly vague on the meaning of the hint. But the only practical way to use it, and the only way that Boost.Unordered supports is to point to an existing element with the same key. - Will only rehash if the insert causes the load factor to be greater to or equal to the maximum load factor. + Can invalidate iterators, but only if the insert causes the load factor to be greater to or equal to the maximum load factor. @@ -1880,7 +1882,7 @@ - Will only rehash if the insert causes the load factor to be greater to or equal to the maximum load factor. + Can invalidate iterators, but only if the insert causes the load factor to be greater to or equal to the maximum load factor. @@ -2092,7 +2094,7 @@ - Will only rehash if the insert causes the load factor to be greater to or equal to the maximum load factor. + Can invalidate iterators, but only if the insert causes the load factor to be greater to or equal to the maximum load factor. @@ -2122,6 +2124,11 @@ n < bucket_count() + + The number of elements in bucket + n. + + @@ -2211,7 +2218,7 @@ float - Changes the container's maximum load factor,using + Changes the container's maximum load factor, using z as a hint. @@ -2223,13 +2230,13 @@ void Changes the number of buckets so that there at least - n buckets, and so that the load factor is less thanthe maximum load factor. + n buckets, and so that the load factor is less than the maximum load factor. Invalidates iterators, and changes the order of elements - The function has no effect if an exception is throw, unless it is thrown by the container’s hash function or comparison function. + The function has no effect if an exception is thrown, unless it is thrown by the container’s hash function or comparison function. @@ -2343,9 +2350,7 @@ The elements are organized into buckets. Keys with the same hash code are stored in the same bucket and elements with equivalent keys are stored next to each other. - The number of buckets is automatically increased whenever an insert will make the load factor greater than the maximum load factor. It can also change as result of calling rehash. - - When the number of buckets change: iterators are invalidated, the elements can change order, and move to different buckets, but pointers and references to elements remain valid. + The number of buckets can be automatically increased by a call to insert, or as the result of calling rehash. @@ -2588,7 +2593,7 @@ - Will only rehash if the insert causes the load factor to be greater to or equal to the maximum load factor. + Can invalidate iterators, but only if the insert causes the load factor to be greater to or equal to the maximum load factor. @@ -2629,7 +2634,7 @@ The standard is fairly vague on the meaning of the hint. But the only practical way to use it, and the only way that Boost.Unordered supports is to point to an existing element with the same key. - Will only rehash if the insert causes the load factor to be greater to or equal to the maximum load factor. + Can invalidate iterators, but only if the insert causes the load factor to be greater to or equal to the maximum load factor. @@ -2655,7 +2660,7 @@ - Will only rehash if the insert causes the load factor to be greater to or equal to the maximum load factor. + Can invalidate iterators, but only if the insert causes the load factor to be greater to or equal to the maximum load factor. @@ -2872,6 +2877,11 @@ n < bucket_count() + + The number of elements in bucket + n. + + @@ -2961,7 +2971,7 @@ float - Changes the container's maximum load factor,using + Changes the container's maximum load factor, using z as a hint. @@ -2973,13 +2983,13 @@ void Changes the number of buckets so that there at least - n buckets, and so that the load factor is less thanthe maximum load factor. + n buckets, and so that the load factor is less than the maximum load factor. Invalidates iterators, and changes the order of elements - The function has no effect if an exception is throw, unless it is thrown by the container’s hash function or comparison function. + The function has no effect if an exception is thrown, unless it is thrown by the container’s hash function or comparison function. @@ -3018,5 +3028,4 @@ -
-
+ diff --git a/doc/unordered.qbk b/doc/unordered.qbk index fb88e703..d0e988c6 100644 --- a/doc/unordered.qbk +++ b/doc/unordered.qbk @@ -18,6 +18,7 @@ and, worst of all, incorrect. Don't take anything in it seriously. [endsect] [include:unordered intro.qbk] +[include:unordered buckets.qbk] [include:unordered comparison.qbk] [include:unordered rationale.qbk]