From 17b4216b49680f9eb2627f9a1c052631fba989be Mon Sep 17 00:00:00 2001 From: Daniel James Date: Thu, 31 May 2007 22:33:39 +0000 Subject: [PATCH] Trying to make the unordered documentation a little better. [SVN r4407] --- doc/buckets.qbk | 148 ++++++++++++++++++++++++++++++++++++++++++++++++ doc/intro.qbk | 113 ++++++++++++++++++++++++++++++++++++ 2 files changed, 261 insertions(+) create mode 100644 doc/buckets.qbk create mode 100644 doc/intro.qbk diff --git a/doc/buckets.qbk b/doc/buckets.qbk new file mode 100644 index 00000000..9227acdb --- /dev/null +++ b/doc/buckets.qbk @@ -0,0 +1,148 @@ +[/ Copyright 2006-2007 Daniel James. + / Distributed under the Boost Software License, Version 1.0. (See accompanying + / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) ] + +[section:buckets The Data Structure] + +The containers are made up of a number of 'buckets', each of which can contain +any number of elements. For example, the following diagram shows an [classref +boost::unordered_set unordered_set] with 7 buckets containing 5 elements, `A`, +`B`, `C`, `D` and `E` (this is just for illustration, in practise containers +will have more buckets). + +[$../../libs/unordered/doc/diagrams/buckets.png] + +In order to decide which bucket to place an element in, the container applies +the hash function, `Hash`, to the element's key (for `unordered_set` and +`unordered_multiset` the key is the whole element, but is refered to as the key +so that the same terminology can be used for sets and maps). This returns a +value of type `std::size_t`. `std::size_t` has a much greater range of values +then the number of buckets, so that container applies another transformation to +that value to choose a bucket to place the element in. + +If at a later date the container wants to find an element in the container it +just has to apply the same process to the element's key to discover which +bucket it is in. If the hash function has worked well the elements will be +evenly distributed amongst the buckets so it will only have to examine a small +number of elements. + +You can see in the diagram that `A` & `D` have been placed in the same bucket. +This means that when looking for these elements, of another element that would +be placed in the same bucket, up to 2 comparison have to be made, making +searching slower. This is known as a collision. To keep things fast we try to +keep these to a minimum. + +[table Methods for Accessing Buckets + [[Method] [Description]] + + [ + [``size_type bucket_count() const``] + [The number of buckets.] + ] + [ + [``size_type max_bucket_count() const``] + [An upper bound on the number of buckets.] + ] + [ + [``size_type bucket_size(size_type n) const``] + [The number of elements in bucket `n`.] + ] + [ + [``size_type bucket(key_type const& k) const``] + [Returns the index of the bucket which would contain k] + ] + [ + [`` + local_iterator begin(size_type n); + local_iterator end(size_type n); + const_local_iterator begin(size_type n) const; + const_local_iterator end(size_type n) const; + ``] + [Return begin and end iterators for bucket `n`.] + ] +] + +[h2 Controlling the number of buckets] + +As more elements are added to an unordered associative container, the number +of elements in the buckets will increase causing performance to get worse. To +combat this the containers increase the bucket count as elements are inserted. + +The standard gives you two methods to influence the bucket count. First you can +specify the minimum number of buckets in the constructor, and later, by calling +`rehash`. + +The other method is the `max_load_factor` member function. The 'load factor' +is the average number of elements per bucket, `max_load_factor` can be used +to give a /hint/ of a value that the load factor should be kept below. The +draft standard doesn't actually require the container to pay much attention +to this value. The only time the load factor is /required/ to be less than the +maximum is following a call to `rehash`. But most implementations will probably +try to keep the number of elements below the max load factor, and set the +maximum load factor something the same or near to your hint - unless your hint +is unreasonably small. + +It is not specified anywhere how member functions other than `rehash` affect +the bucket count, although `insert` is only allowed to invalidate iterators +when the insertion causes the load factor to reach the maximum. Which will +typically mean that insert will only change the number of buckets when an +insert causes this. + +In a similar manner to using `reserve` for `vector`s, it can be a good idea +to call `rehash` before inserting a large number of elements. This will get +the expensive rehashing out of the way and let you store iterators, safe in +the knowledge that they won't be invalidated. If you are inserting `n` +elements into container `x`, you could first call: + + x.rehash((x.size() + n) / x.max_load_factor() + 1); + +[blurb Note: `rehash`'s argument is the number of buckets, not the number of +elements, which is why the new size is divided by the maximum load factor. The +`+ 1` is required because the container is allowed to resize when the load +factor is equal to the maximum load factor.] + +[table Methods for Controlling Bucket Size + [[Method] [Description]] + + [ + [``float load_factor() const``] + [The average number of elements per bucket.] + ] + [ + [``float max_load_factor() const``] + [Returns the current maximum load factor.] + ] + [ + [``float max_load_factor(float z)``] + [Changes the container's maximum load factor, using `z` as a hint.] + ] + [ + [``void rehash(size_type n)``] + [Changes the number of buckets so that there at least n buckets, and + so that the load factor is less than the maximum load factor.] + ] + +] + +[/ I'm not at all happy with this section. So I've commented it out.] + +[/ h2 Rehash Techniques] + +[/If the container has a load factor much smaller than the maximum, `rehash` +might decrease the number of buckets, reducing the memory usage. This isn't +guaranteed by the standard but this implementation will do it. + +If you want to stop the table from ever rehashing due to an insert, you can +set the maximum load factor to infinity (or perhaps a load factor that it'll +never reach - say `x.max_size()`. As you can only give a 'hint' for the maximum +load factor, this isn't guaranteed to work. But again, it'll work in this +implementation. (TODO: If an unordered container with infinite load factor +is copied, bad things could happen. So maybe this advice should be removed. Or +maybe the implementation should cope with that). + +If you do this and want to make the container rehash, `rehash` will still work. +But be careful that you only ever call it with a sufficient number of buckets +- otherwise it's very likely that the container will decrease the bucket +count to an overly small amount.] + +[endsect] diff --git a/doc/intro.qbk b/doc/intro.qbk new file mode 100644 index 00000000..de4c36ec --- /dev/null +++ b/doc/intro.qbk @@ -0,0 +1,113 @@ +[/ Copyright 2006-2007 Daniel James. + / Distributed under the Boost Software License, Version 1.0. (See accompanying + / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) ] + +[def __tr1__ + [@http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2009.pdf + C++ Standard Library Technical Report]] +[def __boost-tr1__ + [@http://www.boost.org/doc/html/boost_tr1.html + Boost.TR1]] +[def __draft__ + [@http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2009.pdf + Working Draft of the C++ Standard]] +[def __hash-table__ [@http://en.wikipedia.org/wiki/Hash_table + hash table]] +[def __hash-function__ [@http://en.wikipedia.org/wiki/Hash_function + hash function]] + +[section:intro Introduction] + +For accessing data based on key lookup, the C++ standard library offers `std::set`, +`std::map`, `std::multiset` and `std::multimap`. These are generally +implemented using balanced binary trees so that lookup time has +logarithmic complexity. That is generally okay, but in many cases a +__hash-table__ can perform better, as accessing data has constant complexity, +on average. The worst case complexity is linear, but that occurs rarely and +with some care, can be avoided. + +Also, the existing containers require a 'less than' comparison object +to order their elements. For some data types this is impossible to implement +or isn't practicle. For a hash table you need an equality function +and a hash function for the key. + +So the __tr1__ introduced the unordered associative containers, which are +implemented using hash tables, and they have now been added to the __draft__. + +This library supplies a standards compliant implementation that is proposed for +addition to boost. If accepted they should also be added to __boost-tr1__. + +`unordered_set` and `unordered_multiset` are defined in the header +<[headerref boost/unordered_set.hpp]> + + namespace boost { + template < + class Key, + class Hash = boost::hash, + class Pred = std::equal_to, + class Alloc = std::allocator > + class ``[classref boost::unordered_set unordered_set]``; + + template< + class Key, + class Hash = boost::hash, + class Pred = std::equal_to, + class Alloc = std::allocator > + class ``[classref boost::unordered_multiset unordered_multiset]``; + } + +`unordered_map` and `unordered_multimap` are defined in the header +<[headerref boost/unordered_map.hpp]> + + namespace boost { + template < + class Key, class T, + class Hash = boost::hash, + class Pred = std::equal_to, + class Alloc = std::allocator > + class ``[classref boost::unordered_map unordered_map]``; + + template< + class Key, class T, + class Hash = boost::hash, + class Pred = std::equal_to, + class Alloc = std::allocator > + class ``[classref boost::unordered_multimap unordered_multimap]``; + } + +If using Boost.TR1, these classes will be included from `` and +``, with the classes included in the `std::tr1` namespace. + +The containers are used in a similar manner to the normal associative +containers: + + #include <``[headerref boost/unordered_map.hpp]``> + #include + + int main() + { + boost::unordered_map x; + x["one"] = 1; + x["two"] = 2; + x["three"] = 3; + + assert(x["one"] == 1); + assert(x["missing"] == 0); + } + +But since the elements aren't ordered, the output of: + + BOOST_FOREACH(map::value_type i, x) { + std::cout<