Trying to make the unordered documentation a little better.

[SVN r4407]
This commit is contained in:
Daniel James
2007-05-31 22:33:39 +00:00
parent ddb67a849d
commit 17b4216b49
2 changed files with 261 additions and 0 deletions

148
doc/buckets.qbk Normal file
View File

@ -0,0 +1,148 @@
[/ Copyright 2006-2007 Daniel James.
/ Distributed under the Boost Software License, Version 1.0. (See accompanying
/ file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) ]
[section:buckets The Data Structure]
The containers are made up of a number of 'buckets', each of which can contain
any number of elements. For example, the following diagram shows an [classref
boost::unordered_set unordered_set] with 7 buckets containing 5 elements, `A`,
`B`, `C`, `D` and `E` (this is just for illustration, in practise containers
will have more buckets).
[$../../libs/unordered/doc/diagrams/buckets.png]
In order to decide which bucket to place an element in, the container applies
the hash function, `Hash`, to the element's key (for `unordered_set` and
`unordered_multiset` the key is the whole element, but is refered to as the key
so that the same terminology can be used for sets and maps). This returns a
value of type `std::size_t`. `std::size_t` has a much greater range of values
then the number of buckets, so that container applies another transformation to
that value to choose a bucket to place the element in.
If at a later date the container wants to find an element in the container it
just has to apply the same process to the element's key to discover which
bucket it is in. If the hash function has worked well the elements will be
evenly distributed amongst the buckets so it will only have to examine a small
number of elements.
You can see in the diagram that `A` & `D` have been placed in the same bucket.
This means that when looking for these elements, of another element that would
be placed in the same bucket, up to 2 comparison have to be made, making
searching slower. This is known as a collision. To keep things fast we try to
keep these to a minimum.
[table Methods for Accessing Buckets
[[Method] [Description]]
[
[``size_type bucket_count() const``]
[The number of buckets.]
]
[
[``size_type max_bucket_count() const``]
[An upper bound on the number of buckets.]
]
[
[``size_type bucket_size(size_type n) const``]
[The number of elements in bucket `n`.]
]
[
[``size_type bucket(key_type const& k) const``]
[Returns the index of the bucket which would contain k]
]
[
[``
local_iterator begin(size_type n);
local_iterator end(size_type n);
const_local_iterator begin(size_type n) const;
const_local_iterator end(size_type n) const;
``]
[Return begin and end iterators for bucket `n`.]
]
]
[h2 Controlling the number of buckets]
As more elements are added to an unordered associative container, the number
of elements in the buckets will increase causing performance to get worse. To
combat this the containers increase the bucket count as elements are inserted.
The standard gives you two methods to influence the bucket count. First you can
specify the minimum number of buckets in the constructor, and later, by calling
`rehash`.
The other method is the `max_load_factor` member function. The 'load factor'
is the average number of elements per bucket, `max_load_factor` can be used
to give a /hint/ of a value that the load factor should be kept below. The
draft standard doesn't actually require the container to pay much attention
to this value. The only time the load factor is /required/ to be less than the
maximum is following a call to `rehash`. But most implementations will probably
try to keep the number of elements below the max load factor, and set the
maximum load factor something the same or near to your hint - unless your hint
is unreasonably small.
It is not specified anywhere how member functions other than `rehash` affect
the bucket count, although `insert` is only allowed to invalidate iterators
when the insertion causes the load factor to reach the maximum. Which will
typically mean that insert will only change the number of buckets when an
insert causes this.
In a similar manner to using `reserve` for `vector`s, it can be a good idea
to call `rehash` before inserting a large number of elements. This will get
the expensive rehashing out of the way and let you store iterators, safe in
the knowledge that they won't be invalidated. If you are inserting `n`
elements into container `x`, you could first call:
x.rehash((x.size() + n) / x.max_load_factor() + 1);
[blurb Note: `rehash`'s argument is the number of buckets, not the number of
elements, which is why the new size is divided by the maximum load factor. The
`+ 1` is required because the container is allowed to resize when the load
factor is equal to the maximum load factor.]
[table Methods for Controlling Bucket Size
[[Method] [Description]]
[
[``float load_factor() const``]
[The average number of elements per bucket.]
]
[
[``float max_load_factor() const``]
[Returns the current maximum load factor.]
]
[
[``float max_load_factor(float z)``]
[Changes the container's maximum load factor, using `z` as a hint.]
]
[
[``void rehash(size_type n)``]
[Changes the number of buckets so that there at least n buckets, and
so that the load factor is less than the maximum load factor.]
]
]
[/ I'm not at all happy with this section. So I've commented it out.]
[/ h2 Rehash Techniques]
[/If the container has a load factor much smaller than the maximum, `rehash`
might decrease the number of buckets, reducing the memory usage. This isn't
guaranteed by the standard but this implementation will do it.
If you want to stop the table from ever rehashing due to an insert, you can
set the maximum load factor to infinity (or perhaps a load factor that it'll
never reach - say `x.max_size()`. As you can only give a 'hint' for the maximum
load factor, this isn't guaranteed to work. But again, it'll work in this
implementation. (TODO: If an unordered container with infinite load factor
is copied, bad things could happen. So maybe this advice should be removed. Or
maybe the implementation should cope with that).
If you do this and want to make the container rehash, `rehash` will still work.
But be careful that you only ever call it with a sufficient number of buckets
- otherwise it's very likely that the container will decrease the bucket
count to an overly small amount.]
[endsect]

113
doc/intro.qbk Normal file
View File

@ -0,0 +1,113 @@
[/ Copyright 2006-2007 Daniel James.
/ Distributed under the Boost Software License, Version 1.0. (See accompanying
/ file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) ]
[def __tr1__
[@http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2009.pdf
C++ Standard Library Technical Report]]
[def __boost-tr1__
[@http://www.boost.org/doc/html/boost_tr1.html
Boost.TR1]]
[def __draft__
[@http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2009.pdf
Working Draft of the C++ Standard]]
[def __hash-table__ [@http://en.wikipedia.org/wiki/Hash_table
hash table]]
[def __hash-function__ [@http://en.wikipedia.org/wiki/Hash_function
hash function]]
[section:intro Introduction]
For accessing data based on key lookup, the C++ standard library offers `std::set`,
`std::map`, `std::multiset` and `std::multimap`. These are generally
implemented using balanced binary trees so that lookup time has
logarithmic complexity. That is generally okay, but in many cases a
__hash-table__ can perform better, as accessing data has constant complexity,
on average. The worst case complexity is linear, but that occurs rarely and
with some care, can be avoided.
Also, the existing containers require a 'less than' comparison object
to order their elements. For some data types this is impossible to implement
or isn't practicle. For a hash table you need an equality function
and a hash function for the key.
So the __tr1__ introduced the unordered associative containers, which are
implemented using hash tables, and they have now been added to the __draft__.
This library supplies a standards compliant implementation that is proposed for
addition to boost. If accepted they should also be added to __boost-tr1__.
`unordered_set` and `unordered_multiset` are defined in the header
<[headerref boost/unordered_set.hpp]>
namespace boost {
template <
class Key,
class Hash = boost::hash<Key>,
class Pred = std::equal_to<Key>,
class Alloc = std::allocator<Key> >
class ``[classref boost::unordered_set unordered_set]``;
template<
class Key,
class Hash = boost::hash<Key>,
class Pred = std::equal_to<Key>,
class Alloc = std::allocator<Key> >
class ``[classref boost::unordered_multiset unordered_multiset]``;
}
`unordered_map` and `unordered_multimap` are defined in the header
<[headerref boost/unordered_map.hpp]>
namespace boost {
template <
class Key, class T,
class Hash = boost::hash<Key>,
class Pred = std::equal_to<Key>,
class Alloc = std::allocator<Key> >
class ``[classref boost::unordered_map unordered_map]``;
template<
class Key, class T,
class Hash = boost::hash<Key>,
class Pred = std::equal_to<Key>,
class Alloc = std::allocator<Key> >
class ``[classref boost::unordered_multimap unordered_multimap]``;
}
If using Boost.TR1, these classes will be included from `<unordered_set>` and
`<unordered_map>`, with the classes included in the `std::tr1` namespace.
The containers are used in a similar manner to the normal associative
containers:
#include <``[headerref boost/unordered_map.hpp]``>
#include <cassert>
int main()
{
boost::unordered_map<std::string, int> x;
x["one"] = 1;
x["two"] = 2;
x["three"] = 3;
assert(x["one"] == 1);
assert(x["missing"] == 0);
}
But since the elements aren't ordered, the output of:
BOOST_FOREACH(map::value_type i, x) {
std::cout<<i.first<<","<<i.second<<"\n";
}
can be in any order. For example, it might be:
two,2
one,1
three,3
missing,0
There are other differences, which will be detailed later.
[endsect]