Trying to make the unordered documentation a little better.

[SVN r4407]
2025-07-31 11:57:15 +02:00 · 2007-05-31 22:33:39 +00:00
parent ddb67a849d
commit 17b4216b49
2 changed files with 261 additions and 0 deletions
--- a/doc/buckets.qbk
+++ b/doc/buckets.qbk
@@ -0,0 +1,148 @@
 [/ Copyright 2006-2007 Daniel James.
 / Distributed under the Boost Software License, Version 1.0. (See accompanying
 / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) ]
 [section:buckets The Data Structure]
 The containers are made up of a number of 'buckets', each of which can contain
 any number of elements. For example, the following diagram shows an [classref
 boost::unordered_set unordered_set] with 7 buckets containing 5 elements, `A`,
 `B`, `C`, `D` and `E` (this is just for illustration, in practise containers
 will have more buckets).
 [$../../libs/unordered/doc/diagrams/buckets.png]
 In order to decide which bucket to place an element in, the container applies
 the hash function, `Hash`, to the element's key (for `unordered_set` and
 `unordered_multiset` the key is the whole element, but is refered to as the key
 so that the same terminology can be used for sets and maps). This returns a
 value of type `std::size_t`.  `std::size_t` has a much greater range of values
 then the number of buckets, so that container applies another transformation to
 that value to choose a bucket to place the element in.
 If at a later date the container wants to find an element in the container it
 just has to apply the same process to the element's key to discover which
 bucket it is in. If the hash function has worked well the elements will be
 evenly distributed amongst the buckets so it will only have to examine a small
 number of elements.
 You can see in the diagram that `A` & `D` have been placed in the same bucket.
 This means that when looking for these elements, of another element that would
 be placed in the same bucket, up to 2 comparison have to be made, making
 searching slower. This is known as a collision. To keep things fast we try to
 keep these to a minimum.  
 [table Methods for Accessing Buckets
    [[Method] [Description]]
    [
        [``size_type bucket_count() const``]
        [The number of buckets.]
    ]
    [
        [``size_type max_bucket_count() const``]
        [An upper bound on the number of buckets.]
    ]
    [
        [``size_type bucket_size(size_type n) const``]
        [The number of elements in bucket `n`.]
    ]
    [
        [``size_type bucket(key_type const& k) const``]
        [Returns the index of the bucket which would contain k]
    ]
    [
        [``
            local_iterator begin(size_type n);
            local_iterator end(size_type n);
            const_local_iterator begin(size_type n) const;
            const_local_iterator end(size_type n) const;
        ``]
        [Return begin and end iterators for bucket `n`.]
    ]
 ]
 [h2 Controlling the number of buckets]
 As more elements are added to an unordered associative container, the number
 of elements in the buckets will increase causing performance to get worse. To
 combat this the containers increase the bucket count as elements are inserted.
 The standard gives you two methods to influence the bucket count. First you can
 specify the minimum number of buckets in the constructor, and later, by calling
 `rehash`.
 The other method is the `max_load_factor` member function. The 'load factor'
 is the average number of elements per bucket, `max_load_factor` can be used
 to give a /hint/ of a value that the load factor should be kept below. The
 draft standard doesn't actually require the container to pay much attention
 to this value. The only time the load factor is /required/ to be less than the
 maximum is following a call to `rehash`. But most implementations will probably
 try to keep the number of elements below the max load factor, and set the
 maximum load factor something the same or near to your hint - unless your hint
 is unreasonably small.
 It is not specified anywhere how member functions other than `rehash` affect
 the bucket count, although `insert` is only allowed to invalidate iterators
 when the insertion causes the load factor to reach the maximum. Which will
 typically mean that insert will only change the number of buckets when an
 insert causes this.
 In a similar manner to using `reserve` for `vector`s, it can be a good idea
 to call `rehash` before inserting a large number of elements. This will get
 the expensive rehashing out of the way and let you store iterators, safe in
 the knowledge that they won't be invalidated. If you are inserting `n`
 elements into container `x`, you could first call:
    x.rehash((x.size() + n) / x.max_load_factor() + 1);
 [blurb Note: `rehash`'s argument is the number of buckets, not the number of
 elements, which is why the new size is divided by the maximum load factor. The
 `+ 1` is required because the container is allowed to resize when the load
 factor is equal to the maximum load factor.]
 [table Methods for Controlling Bucket Size
    [[Method] [Description]]
    [
        [``float load_factor() const``]
        [The average number of elements per bucket.]
    ]
    [
        [``float max_load_factor() const``]
        [Returns the current maximum load factor.]
    ]
    [
        [``float max_load_factor(float z)``]
        [Changes the container's maximum load factor, using `z` as a hint.]
    ]
    [
        [``void rehash(size_type n)``]
        [Changes the number of buckets so that there at least n buckets, and
        so that the load factor is less than the maximum load factor.]
    ]
 ]
 [/ I'm not at all happy with this section. So I've commented it out.]
 [/ h2 Rehash Techniques]
 [/If the container has a load factor much smaller than the maximum, `rehash`
 might decrease the number of buckets, reducing the memory usage. This isn't
 guaranteed by the standard but this implementation will do it.
 If you want to stop the table from ever rehashing due to an insert, you can
 set the maximum load factor to infinity (or perhaps a load factor that it'll
 never reach - say `x.max_size()`. As you can only give a 'hint' for the maximum
 load factor, this isn't guaranteed to work. But again, it'll work in this
 implementation. (TODO: If an unordered container with infinite load factor
 is copied, bad things could happen. So maybe this advice should be removed. Or
 maybe the implementation should cope with that).
 If you do this and want to make the container rehash, `rehash` will still work.
 But be careful that you only ever call it with a sufficient number of buckets
 - otherwise it's very likely that the container will decrease the bucket
 count to an overly small amount.]
 [endsect]
--- a/doc/intro.qbk
+++ b/doc/intro.qbk
@@ -0,0 +1,113 @@
 [/ Copyright 2006-2007 Daniel James.
 / Distributed under the Boost Software License, Version 1.0. (See accompanying
 / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) ]
 [def __tr1__ 
    [@http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2009.pdf
    C++ Standard Library Technical Report]]
 [def __boost-tr1__
    [@http://www.boost.org/doc/html/boost_tr1.html
    Boost.TR1]]
 [def __draft__
    [@http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2009.pdf
    Working Draft of the C++ Standard]]
 [def __hash-table__ [@http://en.wikipedia.org/wiki/Hash_table
    hash table]]
 [def __hash-function__ [@http://en.wikipedia.org/wiki/Hash_function
    hash function]]
 [section:intro Introduction]
 For accessing data based on key lookup, the C++ standard library offers `std::set`,
 `std::map`, `std::multiset` and `std::multimap`. These are generally
 implemented using balanced binary trees so that lookup time has
 logarithmic complexity. That is generally okay, but in many cases a
 __hash-table__ can perform better, as accessing data has constant complexity,
 on average. The worst case complexity is linear, but that occurs rarely and
 with some care, can be avoided.
 Also, the existing containers require a 'less than' comparison object
 to order their elements. For some data types this is impossible to implement
 or isn't practicle. For a hash table you need an equality function
 and a hash function for the key.
 So the __tr1__ introduced the unordered associative containers, which are
 implemented using hash tables, and they have now been added to the __draft__.
 This library supplies a standards compliant implementation that is proposed for
 addition to boost. If accepted they should also be added to __boost-tr1__.
 `unordered_set` and `unordered_multiset` are defined in the header
 <[headerref boost/unordered_set.hpp]>
    namespace boost {
        template <
            class Key,
            class Hash = boost::hash<Key>, 
            class Pred = std::equal_to<Key>, 
            class Alloc = std::allocator<Key> > 
        class ``[classref boost::unordered_set unordered_set]``;
        template<
            class Key,
            class Hash = boost::hash<Key>, 
            class Pred = std::equal_to<Key>, 
            class Alloc = std::allocator<Key> > 
        class ``[classref boost::unordered_multiset unordered_multiset]``;
    }
 `unordered_map` and `unordered_multimap` are defined in the header
 <[headerref boost/unordered_map.hpp]>
    namespace boost {
        template <
            class Key, class T,
            class Hash = boost::hash<Key>,
            class Pred = std::equal_to<Key>,
            class Alloc = std::allocator<Key> >
        class ``[classref boost::unordered_map unordered_map]``;
        template<
            class Key, class T,
            class Hash = boost::hash<Key>,
            class Pred = std::equal_to<Key>,
            class Alloc = std::allocator<Key> >
        class ``[classref boost::unordered_multimap unordered_multimap]``;
    }
 If using Boost.TR1, these classes will be included from `<unordered_set>` and
 `<unordered_map>`, with the classes included in the `std::tr1` namespace.
 The containers are used in a similar manner to the normal associative
 containers:
    #include <``[headerref boost/unordered_map.hpp]``>
    #include <cassert>
    int main()
    {
        boost::unordered_map<std::string, int> x;
        x["one"] = 1;
        x["two"] = 2;
        x["three"] = 3;
        assert(x["one"] == 1);
        assert(x["missing"] == 0);
    }
 But since the elements aren't ordered, the output of:
    BOOST_FOREACH(map::value_type i, x) {
        std::cout<<i.first<<","<<i.second<<"\n";
    }
 can be in any order. For example, it might be:
    two,2
    one,1
    three,3
    missing,0
 There are other differences, which will be detailed later.
 [endsect]