diff --git a/doc/allocplus/adaptive_pool_linux-gcc-4.1_1.jpg b/doc/allocplus/adaptive_pool_linux-gcc-4.1_1.jpg new file mode 100644 index 0000000..12d33dd Binary files /dev/null and b/doc/allocplus/adaptive_pool_linux-gcc-4.1_1.jpg differ diff --git a/doc/allocplus/adaptive_pool_linux-gcc-4.1_2.jpg b/doc/allocplus/adaptive_pool_linux-gcc-4.1_2.jpg new file mode 100644 index 0000000..550ca06 Binary files /dev/null and b/doc/allocplus/adaptive_pool_linux-gcc-4.1_2.jpg differ diff --git a/doc/allocplus/adaptive_pool_msvc-7.1_1.jpg b/doc/allocplus/adaptive_pool_msvc-7.1_1.jpg new file mode 100644 index 0000000..2f3c8d8 Binary files /dev/null and b/doc/allocplus/adaptive_pool_msvc-7.1_1.jpg differ diff --git a/doc/allocplus/adaptive_pool_msvc-7.1_2.jpg b/doc/allocplus/adaptive_pool_msvc-7.1_2.jpg new file mode 100644 index 0000000..407db0a Binary files /dev/null and b/doc/allocplus/adaptive_pool_msvc-7.1_2.jpg differ diff --git a/doc/allocplus/adaptive_pool_msvc-7.1_3.jpg b/doc/allocplus/adaptive_pool_msvc-7.1_3.jpg new file mode 100644 index 0000000..932218b Binary files /dev/null and b/doc/allocplus/adaptive_pool_msvc-7.1_3.jpg differ diff --git a/doc/allocplus/adaptive_pool_msvc-7.1_4.jpg b/doc/allocplus/adaptive_pool_msvc-7.1_4.jpg new file mode 100644 index 0000000..3e6dfde Binary files /dev/null and b/doc/allocplus/adaptive_pool_msvc-7.1_4.jpg differ diff --git a/doc/allocplus/adaptive_pool_msvc-7.1_5.jpg b/doc/allocplus/adaptive_pool_msvc-7.1_5.jpg new file mode 100644 index 0000000..18c7091 Binary files /dev/null and b/doc/allocplus/adaptive_pool_msvc-7.1_5.jpg differ diff --git a/doc/allocplus/adaptive_pool_msvc-7.1_6.jpg b/doc/allocplus/adaptive_pool_msvc-7.1_6.jpg new file mode 100644 index 0000000..5f60fe3 Binary files /dev/null and b/doc/allocplus/adaptive_pool_msvc-7.1_6.jpg differ diff --git a/doc/allocplus/burst_allocation_linux-gcc-4.1_1.jpg b/doc/allocplus/burst_allocation_linux-gcc-4.1_1.jpg new file mode 100644 index 0000000..55a70a3 Binary files /dev/null and b/doc/allocplus/burst_allocation_linux-gcc-4.1_1.jpg differ diff --git a/doc/allocplus/burst_allocation_linux-gcc-4.1_2.jpg b/doc/allocplus/burst_allocation_linux-gcc-4.1_2.jpg new file mode 100644 index 0000000..7fdf169 Binary files /dev/null and b/doc/allocplus/burst_allocation_linux-gcc-4.1_2.jpg differ diff --git a/doc/allocplus/burst_allocation_msvc-7.1_1.jpg b/doc/allocplus/burst_allocation_msvc-7.1_1.jpg new file mode 100644 index 0000000..22f647f Binary files /dev/null and b/doc/allocplus/burst_allocation_msvc-7.1_1.jpg differ diff --git a/doc/allocplus/burst_allocation_msvc-7.1_2.jpg b/doc/allocplus/burst_allocation_msvc-7.1_2.jpg new file mode 100644 index 0000000..a1835fe Binary files /dev/null and b/doc/allocplus/burst_allocation_msvc-7.1_2.jpg differ diff --git a/doc/allocplus/bwd_expansion_linux-gcc-4.1.jpg b/doc/allocplus/bwd_expansion_linux-gcc-4.1.jpg new file mode 100644 index 0000000..2a90402 Binary files /dev/null and b/doc/allocplus/bwd_expansion_linux-gcc-4.1.jpg differ diff --git a/doc/allocplus/bwd_expansion_msvc-7.1.jpg b/doc/allocplus/bwd_expansion_msvc-7.1.jpg new file mode 100644 index 0000000..c44ec10 Binary files /dev/null and b/doc/allocplus/bwd_expansion_msvc-7.1.jpg differ diff --git a/doc/allocplus/fwd_expansion_linux-gcc-4.1.jpg b/doc/allocplus/fwd_expansion_linux-gcc-4.1.jpg new file mode 100644 index 0000000..7759e6f Binary files /dev/null and b/doc/allocplus/fwd_expansion_linux-gcc-4.1.jpg differ diff --git a/doc/allocplus/fwd_expansion_msvc-7.1_1.jpg b/doc/allocplus/fwd_expansion_msvc-7.1_1.jpg new file mode 100644 index 0000000..2808ba1 Binary files /dev/null and b/doc/allocplus/fwd_expansion_msvc-7.1_1.jpg differ diff --git a/doc/allocplus/fwd_expansion_msvc-7.1_2.jpg b/doc/allocplus/fwd_expansion_msvc-7.1_2.jpg new file mode 100644 index 0000000..782e5e7 Binary files /dev/null and b/doc/allocplus/fwd_expansion_msvc-7.1_2.jpg differ diff --git a/doc/allocplus/fwd_expansion_msvc-7.1_3.jpg b/doc/allocplus/fwd_expansion_msvc-7.1_3.jpg new file mode 100644 index 0000000..98aae03 Binary files /dev/null and b/doc/allocplus/fwd_expansion_msvc-7.1_3.jpg differ diff --git a/doc/allocplus/fwd_expansion_msvc-7.1_4.jpg b/doc/allocplus/fwd_expansion_msvc-7.1_4.jpg new file mode 100644 index 0000000..54671b4 Binary files /dev/null and b/doc/allocplus/fwd_expansion_msvc-7.1_4.jpg differ diff --git a/doc/allocplus/index.html b/doc/allocplus/index.html new file mode 100644 index 0000000..db0eb08 --- /dev/null +++ b/doc/allocplus/index.html @@ -0,0 +1,1505 @@ + + + + + +
+
+Ion Gaztañaga <igaztanaga@gmail.com>
+
vector
vector
vector
vector
+I'm sure that many C++ programmers have ever wondered where does good old realloc
+fit in C++. And that's a good question. Could we improve std::vector
performance
+using memory expansion mechanisms to avoid too many copies? But std::vector
+is not the only container that could benefit from some classic strategies memory allocation: What if we
+could take advantage of the insertion of multiple elements in std::list
using
+a burst allocation mechanism that could amortize costs (mutex locks, searches...) that
+can't be amortized when using single node allocation strategies? Some space-saving
+strategies also come to mind with node containers: Can we find a mechanism to build a
+memory allocator that offers the size-saving benefits of a simple segregated storage
+strategy (see Boost.Pool) without suffering
+too big node pools when most of nodes have been deallocated?
+
+This article will explain how some classic improvement strategies have been applied to +C++ allocators. Many of these mechanism are being used for shared-memory allocators and +containers (see Boost.Interprocess) +where size restrictions are tighter than with heap allocators and other strategies (e.g. +per-thread-pthread pools) are not applicable. So we'll be using +Boost.Interprocess containers to measure the speed up that we can obtain when using +those classic strategies in C++. +
++Obviously, these C++ allocators need some underlying general purpose allocator, since +new and delete don't offer expansion and burst capabilities. In this article we'll +use a modified Doug Lea Malloc, aka +DLmalloc memory allocator with new basic functions that will offer the necessary +underlying machinery to implement all these classic strategies. DLmalloc is known to be +very size and speed efficient, and this allocator is used as the basis of many malloc +implementations, including multithreaded allocators built above DLmalloc +(See ptmalloc2, +ptmalloc3 or +nedmalloc +). So +improvements obtained by the classic strategies applied to C++ will measure real-world +speedups, even when the allocator is very efficient. +
++I've developed some additional classes, like STL allocators, pools and tests, I've put +all this code under Boost Software license, and I've respected Boost guidelines so that +the developed code can be considered integrated in the Boost Container.. +
+ +vector
+There've been a couple of C++ proposal to add in-place expansion mechanisms to standard +allocators (N1953: +Upgrading the Interface of Allocators using API Versioning written by Howard Hinnant +and N2045: +Improving STL allocators written by me based on Hinnant's proposal). Unfortunately +they didn't receive much interest in the committee but there is hope that we could get +some of this functionality for C++1x if the C++ community shows some interest. Let's +review a bit what these proposals said: +
+Hinnant's proposal adds some new functionalities to C++ allocators: +
+The first functionality is interesting to avoid wasting some precious bytes the memory
+allocator adds in the end of the buffer due to alignment reasons or because its own
+memory allocation algorithm. As an example, DLmalloc on 32 bit systems will need a
+minimum of 16 bytes for every allocation (4 bytes are payload and 12 bytes are available
+for the user): If we try to initialize std::string
with a 4 character
+narrow string ("abcd"
) we will waste 8 bytes, because we have no way to know
+that there is room in our newly allocated buffer that could be used for longer strings
+without going again through the allocator. Other allocators waste even
+more memory. In 64 bit systems the minimum size (and thus, the waste) is bigger:
+32 bytes (24 bytes available for the user) for DLmalloc.
+
+The second functionality is the classical use case of realloc in C: instead of
+allocating a new buffer, the programmer tries to expand the current one. If the
+expansion succeeds, there is no need to copy data to the new buffer (in C++, that means
+avoiding potentially costly copy-constructors). The third one is exactly the inverse: a
+vector that has grown considerably has now a few elements, vector.capacity()
is
+much bigger than vector.size()
and we'd like to return memory
+to the system. Many programmers use the
+shrink-to-fit trick to return memory to the system, but this trick needs
+a possibly throwing copy, whereas a shrink to fit operation implemented
+by the allocator would not need any copying at all.
+
+The last functionality is really interesting because a vector increases its capacity
+usually using a growth factor (usually 1.5f or 2.0f) to avoid too many reallocations. That is, when we insert
+new objects in a vector and vector.capacity() == vector.size()
, then the
+vector allocates much more memory than the size needed to insert a new object. When
+allocating new memory to insert a new object, limit size is size() + 1
,
+preferred size is size()*growth_factor
and received size is the real
+size of the buffer. When inserting a new element in a vector, is usually better to
+expand the memory to hold the new element than to allocate a new bigger buffer.
+
+N2045 +makes some additions: +
operator new
and operator
+new[]
.
++The first addition is useful for embedded systems and other size-limited memories, like +fixed-size shared memory. If a vector can only be expanded forward, then it can't reuse +preceding free memory if that preceding memory is not big enough (that is, 1.5 or 2.0 times bigger than +the current buffer). With backwards expansion, this limitation disappears and +backwards expansion can be combined with forward expansion to optimize memory usage. +
++The vector has to copy elements backwards so backwards expansion needs exactly the +same number of copies as a new allocation. Many times backwards expansion can be implemented +very efficiently in some memory algorithms that have fast access to previous and next memory buffers. +Backwards expansion also has some downsides: +
sizeof(Type)
== 12 can't be expanded backwards 16 bytes, because the
+distance between the old beginning and the new beginning is not multiple of 12, and thus, the
+old object will be partially overwritten and it can't be safely move-assigned.+The second addition tries to compact in a single call several tries to expand or +allocate memory because usually a vector will first try to expand memory and if this fails, +it will try to allocate a new buffer. Since checking for expansion +capabilities is quite fast comparing to mutex locking and other checks performed by the +allocator, both functions can be combined in a single call. +
+
+The third addition (separating node and array allocation) is based on the fact that with
+the addition of expansion functions, node allocators based classic segregated storage
+pools won't be able to be allocators with expansion capabilities. This might not sound
+like a big problem for pure-node containers like std::list
but some hybrid
+containers like std::unordered_map
wouldn't be able to use both segregated
+storage (for nodes) and expansion features (for the bucket array).
+
+Furthermore, some allocators (see +N2554 The Scoped Allocator Model (Rev 2)) can be propagated through containers, so +having an allocator that can use in-place expansion capabilities and node-pooling +capabilities is very interesting. And this can only be achieved if the allocator uses +different functions for single element and array operations. +
++The last two +additions are just optimizations to reduce locking and to minimize the number of allocation +calls. +
++Hinnant's proposal was implemented in the Freescale (then Metrowerks) standard library +(MSL) with great success. My approach was implemented in +Boost.Interprocess. This article is an evolution of my last proposal but applying +these techniques in a widely used heap allocator. +
+ +allocation_command
. This is the
+explanation of the function:
++ enum allocation_type + { ++//Bitwise OR (|) combinable values + allocate_new = ..., + expand_fwd = ..., + expand_bwd = ..., + shrink_in_place = ..., + try_shrink_in_place = ..., + nothrow_allocation = ... + }; + + + template<class T> + std::pair<T *, bool> allocation_command + ( allocation_type command, std::size_t limit_size + , std::size_t preferred_size, std::size_t &received_size, T *reuse_ptr = 0); + +
+Preconditions for this function: +
++
command
must contain at least of these values: shrink_in_place
,
+try_shrink_in_place
, try_shrink_in_place
, expand_fwd
, expand_bwd
or allocate_new
.
+command
contains the value shrink_in_place
it can't
+contain any of these values: expand_fwd
, expand_bwd
,
+allocate_new
, try_shrink_in_place
.
+command
contains the value try_shrink_in_place
it can't
+contain any of these values: expand_fwd
, expand_bwd
,
+allocate_new
, shrink_in_place
.
+command
contains expand_fwd
or expand_bwd
, the
+parameter reuse_ptr
+must be non-null and returned by a previous allocation function.
+command
contains the value shrink_in_place
,
+the parameter limit_size
+must be equal or greater than the parameter preferred_size
.
+command
contains any of these values: expand_fwd
or
+expand_bwd
, the parameter limit_size
must be equal or less than the
+parameter preferred_size
.+Which are the effects of this function: +
++
command
contains the value shrink_in_place
,
+(with optional additional nothrow_allocation
) the function will
+try to reduce the size of the memory block referenced by pointer reuse_ptr
to the
+value preferred_size
moving only the end of the block. If it's not possible, it
+will try to reduce the size of the memory block as much as possible as long as this results in
+size(p) <= limit_size
. Success is reported only if this results in preferred_size
+<= size(p)
and size(p) <= limit_size
.
+command
contains the value try_shrink_in_place
,
+(with optional additional nothrow_allocation
) the function will act as if
+a shrink_in_place
was demaned but it won't shrink the buffer. This function
+is useful to know if a shrink operation will have success with the given parameters and
+obtain in the parameter received_size
a value that is guaranteed to succeed
+as limit_size
on a subsequent shrink in place operation. This function is
+useful to know exactly how many objects a caller should destroy with the given
+limit_size
and preferred_size
.
+command
only contains the value expand_fwd
(with
+optional additional nothrow_allocation
), the allocator will try to increase the
+size of the memory block referenced by pointer reuse moving only the end of the block to the
+value preferred_size
. If it's not possible, it will try to increase the size of
+the memory block as much as possible as long as this results in size(p) >= limit_size
.
+Success is reported only if this results in limit_size <= size(p)
.
+command
only contains the value expand_bwd
(with
+optional additional nothrow_allocation
), the allocator will try to increase the
+size of the memory block referenced by pointer reuse_ptr
only moving the start of
+the block to a returned new position new_ptr
. If it's not possible, it will try
+to move the start of the block as much as possible as long as this results in size(new_ptr)
+>= limit_size
. Success is reported only if this results in limit_size <=
+size(new_ptr)
.
+command
only contains the value allocate_new
(with
+optional additional nothrow_allocation
), the allocator will try to allocate
+memory for preferred_size
objects. If it's not possible it will try to allocate
+memory for at least limit_size
+objects.
+command
only contains a combination of expand_fwd
and
+allocate_new
, (with optional additional nothrow_allocation
) the
+allocator will try first the forward expansion. If this fails, it would try a new
+allocation.
+command
only contains a combination of expand_bwd
and
+allocate_new
(with optional additional nothrow_allocation
), the
+allocator will try first to obtain preferred_size
objects using both methods if
+necessary. If this fails, it will try to obtain limit_size
+objects using both methods if necessary.
+command
only contains a combination of expand_fwd
and
+expand_bwd
(with optional additional nothrow_allocation
), the
+allocator will try first forward expansion. If this fails it will try to obtain preferred_size
+objects using backwards expansion or a combination of forward and backwards expansion. If this
+fails, it will try to obtain limit_size
+objects using both methods if necessary.
+command
only contains a combination of allocation_new, expand_fwd
+and expand_bwd
, (with optional additional nothrow_allocation
) the
+allocator will try first forward expansion. If this fails it will try to obtain preferred_size
+objects using new allocation, backwards expansion or a combination of forward and backwards
+expansion. If this fails, it will try to obtain limit_size
+objects using the same methods.
+received_size
.
+On failure the allocator writes in received_size
a possibly successful limit_size
+parameter for a new call.+Throws an exception if two conditions are met: +
nothrow_allocation
.+This function returns: +
nothrow_allocation
+the first member will be 0 if the allocation/expansion fails or there is an error in
+preconditions.
++Notes: +
char
+as template argument the returned buffer will be suitably aligned to hold any type.
+char
as template argument and a backwards expansion is
+performed, although properly aligned, the returned buffer might not be suitable because the
+distance between the new beginning and the old beginning might not multiple of the type the
+user wants to construct, since due to internal restrictions the expansion can be slightly
+bigger than the requested bytes. When performing backwards expansion, if you have already
+constructed objects in the old buffer, it's important to correctly specify the type.+Obviously, we need an underlying C function that offers these possibilities. +I've added the following function to the modified dlmalloc library: + +
+ +typedef struct boost_cont_command_ret_impl +{ + void *first; + int second; +}boost_cont_command_ret_t; + +boost_cont_command_ret_t boost_cont_allocation_command + ( allocation_type command , size_t sizeof_object + , size_t limit_objects , size_t preferred_objects + , size_t *received_objects , void *reuse_ptr ); ++ +
+This function does the same job as allocation_command
, so it takes the
+same options and an additional parameter indicating the size of the object (specially
+important for backwards expansion). This function does not throw exceptions because C
+has no exceptions but other than these two issues, its use is exactly the same
+as allocation_command
.
+
+
+To compare apples to apples, we just can't compare the standard allocator versus an
+allocator based on DLmalloc with expansion capabilities. For this purpose we develop an
+allocator (called, surprisingly, ba::allocator
). Here's the declaration:
+
+ +namespace boost { +namespace container { + +template<class T, unsigned Version = 2, unsigned int AllocationDisableMask = 0> +class allocator; + +}+//namespace container { +}//namespace boost { + +
+The first template parameter is the type of the objects to be allocated. The second is
+the version of the allocator (see
+N1953) for details. If Version is 1, then the allocator is a standard conforming
+classic allocator around dlmalloc
and dlfree
with no
+additional features. If Version is 2, then the allocator has expansion capabilities and
+other features like specifying limit and preferred sizes and obtaining the real size of
+an allocated memory, separated node and array allocations, etc...
+
+The third parameter is a mask that can be used to disable some capabilities. For
+example, specifying expand_bwd | expand_fwd
will disable forward
+and backwards expansion and the allocator will only provide new buffer allocations.
+
vector
+Let's first present the class that will be used in our performance tests. It's a simple
+wrapper over an int
, with a copy constructor that realizes a job that will
+be equivalent to most move-constructors in C++0x. Of course performance gains will be
+bigger in copyable but non-moveable heavy objects, but I think this class is fair
+enough.
+
+ +class MyInt +{ + int int_; + + public: + MyInt(int i = 0) : int_(i){} + + MyInt(const MyInt &other) + : int_(other.int_) + {} + + MyInt & operator=(const MyInt &other) + { + int_ = other.int_; + return *this; + } +}; + ++
+Now we'll push back an element one by one in a vector, and measure the time using +Boost.DateTime: +
++ unsigned int numalloc = 0, numexpand = 0, capacity = 0; + + ptime tini = microsec_clock::universal_time(); + + for(unsigned int r = 0; r != num_iterations; ++r){ + bi::vector<MyInt, IntAllocator> v; + v.reset_alloc_stats();+//Reset allocation statistics + for(unsigned int e = 0; e != num_elements; ++e) + v.push_back(e); + numalloc += v.num_alloc; numexpand += v.num_expand_fwd; + capacity = static_cast<unsigned int>(v.capacity()); + } + + ptime tend = microsec_clock::universal_time(); +//... +
+We'll test 4 different allocators: +
std::allocator<int>
, the standard
+ allocator that comes with the C++ compilerba::allocator<int, 1>
, a STL
+ conformant allocator around dlmalloc
and dlfree
.ba::allocator<int, 2, expand_bwd |
+ expand_fwd>
, an allocator with capabilities to specify a limit size the
+ preferred size and real size of a buffer, but with expansion capabilities disabled.ba::allocator<int, 2>
, a fully
+ equipped improved allocator. (Note: There is no need to disable backwards expansions because
+ forward expansion has always preference over new allocations and backwards expansion).
+and we'll use several [num_iterations, num_elements]
combinations: [10000,
+10000], [100000, 1000], [1000000, 100] and [10000000, 10]
+
+For each allocator we'll show the following information: +
push_back
speed (normalized to AllocatorPlusV1)
+As we see in the picture, DLmalloc is faster than the standard allocator
+provided by MSVC 7.1 (StdAllocator vs. AllocatorPlus). We also see that
+V2 allocators call less times to allocator functions (for size=10, 4 times vs. 6 times),
+because using the received_size parameter of allocation_command
the
+container can take advantage of the real size of the newly allocated small buffer.
+
+For big sizes (size=10000), due to the expansion capabilities of AllocatorPlusV2, +this is twice as fast as AllocatorPlusV2Mask and 2.5 times faster than the standard +conforming, DLmalloc based AllocatorPlusV1. As we can see in the third graph, only +AllocatorPlusV2 performs expand in place, whereas the rest only allocate new buffers. +
+ +
+Now we'll execute the same test in a Linux (Suse 10.1) virtual machine (VMWare 6) using gcc-4.1 compiler
+and will check push_back
times:
+
+We see that the standard allocator is faster than our V1 (a standard conforming DLmalloc wrapper) +by a small margin for some sizes and slightly slower for bigger sizes. +That's because gblic malloc is based on ptmalloc2, which is actually +based on DLmalloc, but uses per-thread-pools, so minimizes mutex locks. +
+ ++Like in Windows, expand-in-place capable allocator is twice as fast as the V1 allocator, +confirming that forward expansion offers a considerable speed up. This speed up will be +noticeable in initialization phases when vectors are filled with many back insertions. The +speed-up will be more noticeable if vector elements are elements with no move constructors +(e.g. third-party classes than can't be updated with move semantics). +
+ +vector
+Historically backwards expansion has not been very used for allocation strategies, basically +because it does not offer the speed improvements offered by forward expansion for +usual vector implementations. Similarly to new allocation, we need to move old objects +(an implementation might avoid data-moving if it manages uninitialized data in the beginning of +the memory buffer, but that's not usual). Backwards expansion has the ability to reduce the peak +memory used by the vector (while allocating new memory, the vector has to maintain the old +buffer until copies data to the new one) effectively reusing free memory placed just before the +buffer to be expanded. +
+ ++As we've said before, the backwards expansion algorithm has to calculate the LCM (least common multiple) +of the memory allocation alignment (8 bytes in DLmalloc) and the size of the object. +For very efficient algorithms like DLmalloc, I've found that calculating the LCM using the +traditional Euclidean algorithm was a big +overhead that made backwards expansion considerably slower than new +allocation. It's necessary to take advantage of the power of 2 memory alignment and implement +optimizations for object sizes also power of 2 or multiple of 2, 4, or 8 bytes, and use +optimized versions to calculate the LCM. Fortunately, +these optimizations made backwards expansion slightly faster than a new allocation, making +backwards expansion better than new allocation in every aspect. +
+ +
+To test backwards expansion we'll execute the same push_back
test
+using the following 4 allocators:
+
+
std::allocator<int>
, the standard
+allocator that comes with the C++ compiler
+ba::allocator<int, 1>
, a STL
+conformant allocator around dlmalloc
and dlfree
.
+ba::allocator<int, 2, expand_bwd |
+expand_fwd>
, an allocator with capabilities to specify a limit size the
+preferred size and real size of a buffer, but with expansion capabilities disabled.
+ba::allocator<int, 2, expand_fwd>
,
+a fully equipped improved allocator, that avoids forward expansion (forward expansion
+has priority over backwards expansion, so we need to disable it to only measure backwards
+expansion).+And now in a Linux (Suse 10.1) virtual machine (VMWare 6) using gcc-4.1 compiler: +
+ +
+We see that for most cases backwards expansion (AllocatorPlusV2) is slightly faster than new
+allocation (AllocatorPlusV1) and new allocation with received_size
capabilities
+(AllocatorPlusV2Mask).
+
vector
+The
+Shrink-to-fit is a widely used idiom that can be improved with the shrink_in_place"
+option provided by allocation_command
method. To test this function we'll use the following test:
+
+ const std::size_t Step = 5; + + ptime tini = microsec_clock::universal_time(); + + typedef boost::interprocess::detail::integral_constant + <unsigned, boost::interprocess::detail::version<Allocator>::value> alloc_version; + + for(unsigned int r = 0; r != num_iterations; ++r){ ++ +//Create a vector with num_elements size + bi::vector<MyInt, IntAllocator> v(num_elements); + v.reset_alloc_stats(); +//Empty the vector erasing the last Step elements and calling shrink_to_fit() + for(unsigned int e = num_elements; e != 0; e -= Step){ + v.erase(v.end() - Step, v.end()); + v.shrink_to_fit(); + } + } + + ptime tend = microsec_clock::universal_time(); +
+That is, we'll initialize the vector with an initial size and will erase the last Step elements (5 in this case),
+and shrink to fit the capacity of the vector. We'll repeat this operation for different
+[num_iterations, num_elements]
combinations: [100,
+10000], [1000, 200] and [10000, 500]
+
+To test shrink to fit we'll execute the test using the following 3 allocators: + +
std::allocator<int>
, the standard
+allocator that comes with the C++ compiler
+ba::allocator<int, 1>
, a STL
+conformant allocator around dlmalloc
and dlfree
.
+ba::allocator<int, 2, expand_fwd>
,
+a fully equipped improved allocator with shrink to fit capacity.
+As the initial vector size grows, AllocatorPlusV2
shines (up to 90 times faster!)
+because a shrink to fit operation is a no-throw
+operation that can have amortized constant-time complexity due to the fact that an allocator usually implements
+shrink to fit with a deallocation: the allocator divides the old buffer in two independent buffers and
+deallocates the second buffer. On the other hand, vectors with allocators with no shrink to fit functionality
+need to allocate a new buffer, move elements to the new one (moving elements is a linear operation
+with the size of the vector) and deallocate the old buffer.
+
+Now we'll execute the same test in a Linux (Suse 10.1) virtual machine (VMWare 6) using gcc-4.1 compiler
+and will check push_back
times:
+
+Result are similar to Windows results, with AllocatorPlusV2
shinning more and more when initial vector
+size grows.
+
+Many times, operations can be performed faster if they are packed in groups, because some
+costs are not related to the number of operations, but to the number of groups. An example
+of this operations is the template
+operation in vectors. If Iterator
is a forward iterator, most implementations
+calculate the distance between begin
and end
so that they know
+the size of the buffer needed to hold all those elements. With this information, they only
+perform one allocation to insert all the range.
+
+Other containers can't take advantage of this possibility because they are node +containers, meaning that each element is obtained with a call to the allocator that +allocates a single element. Since each element has its own independent memory, these +containers offer stronger iterator validity guarantees and pointers and references +to inserted objects are not easily invalidated. On the other hand, inserting N elements in +those containers, require N calls to the allocator. +
+ ++To solve this, the idea is to have a function that could allocate a lot of objects in a single +call, guaranteeing that each object can be individually deallocatable. Some memory management +algorithms offer this possibility, even with the possibility to specify a different buffer size +for each operation in a group. For example, DLmalloc offers the following two functions: + +
+void** independent_calloc(size_t n_elements, size_t element_size, void*chunks[]); ++ +
+Explanation: independent_calloc
is similar to calloc, but instead of returning a
+single cleared space, it returns an array of pointers to n_elements
+independent elements that can hold contents of size elem_size, each
+of which starts out cleared, and can be independently freed,
+realloc'ed etc. The elements are guaranteed to be adjacently
+allocated (this is not guaranteed to occur with multiple callocs or
+mallocs), which may also improve cache locality in some
+applications. [...]
+
+void** independent_comalloc(size_t n_elements, size_t sizes[], void* chunks[]); ++ +
+Explanation: independent_comalloc
allocates, all at once, a set of n_elements
+chunks with sizes indicated in the "sizes" array. It returns an array of pointers to these elements,
+each of which can be independently freed, realloc'ed etc. The elements are guaranteed to be
+adjacently allocated (this is not guaranteed to occur with multiple callocs or mallocs),
+which may also improve cache locality in some applications. [...]
+
+For these two functions each element must be individually freed when it is no longer +needed and objects are guaranteed to be contiguously allocated (to help cache locality). +Also, the "chunks" argument is optional (i.e., may be null, which is +probably the most typical usage). If it is null, the returned array +is itself dynamically allocated and should also be freed when it is +no longer needed. Otherwise, the chunks array must be of at least +n_elements in length. It is filled in with the pointers to the +chunks. +
+ +
+According to DLmalloc documentation independent_calloc
+simplifies and speeds up implementations of many
+kinds of pools. It may also be useful when constructing large data
+structures that initially have a fixed number of fixed-sized nodes,
+but the number is not known at compile time, and some of the nodes
+may later need to be freed:
+
+ struct Node { int item; struct Node* next; }; + + struct Node* build_list() { + struct Node** pool; + int n = read_number_of_nodes_needed(); + if (n <= 0) return 0; + pool = (struct Node**)(independent_calloc(n, sizeof(struct Node), 0); + if (pool == 0) die(); ++ + + +// organize into a linked list... + struct Node* first = pool[0]; + for (i = 0; i < n-1; ++i) + pool[i]->next = pool[i+1]; + free(pool);// Can now free the array (or not, if it is needed later) + return first; + } +
+According to DLmalloc documentation independent_comalloc
+can be used to speed up allocation in cases where several structs or
+objects must always be allocated at the same time:
+
+
+ struct Head { ... } + struct Foot { ... } + + void send_message(char* msg) { + int msglen = strlen(msg); + size_t sizes[3] = { sizeof(struct Head), msglen, sizeof(struct Foot) }; + void* chunks[3]; + if (independent_comalloc(3, sizes, chunks) == 0) + die(); + struct Head* head = (struct Head*)(chunks[0]); + char* body = (char*)(chunks[1]); + struct Foot* foot = (struct Foot*)(chunks[2]); ++ + + +// ... + } +
+Unfortunately, these two functions have some downsides when working with some containers: + +
list
is forced to allocate an
+array to hold pointers to these nodes) and imposes an unnecessary overhead to node containers.+The first issue (the required external or internally allocated array) can be avoided if every memory node is linked +with an intrusive singly linked list and offering an iterator-like interface. The intrusive list is built in the first bytes +reserved for the user data because the minimum node size guarantees 12 bytes for 32 bits systems and 24 bytes for 64 bit systems. +Obviously, once the user has overwritten that memory, there is no way to iterate to the next node so the programmer must be careful. Let's +present these new functions: + +
++ +These functions offer an emulation of a C++ input iterator. + +/* Iterator functions */ +typedef /**/ multialloc_it_t;/* Iterator type */ +BOOST_CONTAINER_INIT_END_IT(IT)/* Postcondition: BOOST_CONTAINER_IS_END_IT(IT) != 0 */ +BOOST_CONTAINER_IS_END_IT(IT)/* Is an end iterator? */ +BOOST_CONTAINER_NEXT_IT(IT)/* operator ++ emulation */ +BOOST_CONTAINER_ADDR_IT(IT)/* operator* emulation, returns the address of the memory */ +
++ +These functions offer the possibility of building an intrusive chain (list) of nodes previously allocated by DLmalloc +and obtain an iterator that can traverse them. These functions are used to build a range that can be deleted +in a single call using with/* Multiallocation chain functions */ +typedef /**/ multialloc_it_chain_t;/* Chain type */ +BOOST_CONTAINER_IT_CHAIN_INIT(CHAIN)/* Postcondition: BOOST_CONTAINER_IT_CHAIN_IT(IT_CHAIN) is end it */ +BOOST_CONTAINER_IT_CHAIN_PUSH_BACK(CHAIN, MEM)/* Push back the node MEM in the intrusive linked list chain */ +BOOST_CONTAINER_IT_CHAIN_PUSH_FRONT(CHAIN, MEM)/* Push front the node MEM in the intrusive linked list chain */ +BOOST_CONTAINER_IT_CHAIN_SPLICE_BACK(CHAIN, CHAIN2)/* Splice chain2 in the end of the first chain */ +BOOST_CONTAINER_IT_CHAIN_IT(IT_CHAIN)/* Return an input iterator that traverse the chain */ +BOOST_CONTAINER_IT_CHAIN_SIZE(IT_CHAIN)/* Returns the size of the chain */ +
boost_cont_multidealloc()
(See below).
+
+++ +/* Some defines for "contiguous_elements" parameters */ +DL_MULTIALLOC_ALL_CONTIGUOUS/* Allocate all elements contiguous on memory */ +DL_MULTIALLOC_DEFAULT_CONTIGUOUS/* The implementation chooses the maximum contiguous elements ++/* Burst allocation: + Allocates "n_elements nodes" of size "elem_size" with a maximum (if possible) of "contiguous_elements". + On success returns an input iterator. On failure returns an "end" iterator.*/ +multialloc_it_t boost_cont_multialloc_nodes (size_t n_elements, size_t elem_size, size_t contiguous_elements); ++/* Burst allocation: + Allocates "n_arrays" of elements, whose size is "sizeof_element" and whose lengths are specified in the "sizes" + parameter. Contiguous memory won't be (if possible) bigger than "contiguous_elements"*sizeof_element. + On success returns an input iterator. On failure returns an "end" iterator.*/ +multialloc_it_t boost_cont_multialloc_arrays (size_t n_arrays, const size_t *lengths, size_t sizeof_element, size_t contiguous_elements); ++/* Burst deallocation: deallocates all the range specified by the iterator */ +void boost_cont_multidealloc(multialloc_it_t it); +
+boost_cont_multialloc_nodes
is a function that allocates multiple nodes of the same size in the same call returning
+an iterator to the range. is a function that allocates multiple arrays in the same call returning
+an iterator to the range. The user can specify the (suggested) contiguous number of elements in both functions.
+boost_cont_multidealloc
deallocates a range of arrays or nodes that have been allocated
+by boost_cont_multialloc_nodes
, boost_cont_multialloc_arrays
or allocated by other DLmalloc functions and
+chained using BOOST_CONTAINER_IT_CHAIN_XXX
functions.
+
+Now that we have a low-level swiss knife we can define a C++ interface for burst allocation: + +
+template<class T> +class allocator +{ ++ +Burst allocation also follows the separation between node (size == 1) and array allocation functions +so that node allocation can use segregated storage mechanisms, whereas array allocation can use +another approach that can be used with expansion functions (//... + typedef /**/ multiallocation_iterator;//wrapper over multialloc_it_t + typedef /**/ multiallocation_chain;//wrapper over multialloc_chain_t ++ //------------------------------------------------------------------ + // Functions for array allocation (expandable memory) + //------------------------------------------------------------------ + + //Allocates many elements of size elem_size in a contiguous block + //of memory. Elements must be individually deallocated with deallocate() + multiallocation_iterator allocate_many(size_type elem_size, size_type n_elements); ++ //Allocates n_elements elements, each one of size elem_sizes[i] + //Elements must be individually deallocated with deallocate() + multiallocation_iterator allocate_many(const size_type *elem_sizes, size_type n_elements); ++ //Deallocates the memory arrays pointed by the iterator range starting at it + void deallocate_many(multiallocation_iterator it); ++ //------------------------------------------------------------------ + //Functions for node (size == 1, non-expandable memory) allocation + //------------------------------------------------------------------ + + //Allocates many elements of size == 1. + //Elements must be individually deallocated with deallocate_one() + multiallocation_iterator allocate_individual(size_type num_elements); ++ //Deallocates the memory nodes pointed by the iterator range starting at it + void deallocate_individual(multiallocation_iterator it); +}; +
allocation_command
).
+
+
++We have all that we need to implement burst allocation and deallocation for STL containers. Now, let's measure +if this leads to any speed improvement. +
+ ++A good place to take advantage of burst allocation is range insertion. Other operations(range assignment, +copy construction) are also good places to implement burst allocation. So let's try this code with several +allocators to test insertion time: +
+ ++ tini = microsec_clock::universal_time(); + + bi::list<MyInt, IntAllocator> l; + for(unsigned int r = 0; r != num_iterations; ++r){ + l.insert(l.end(), num_elements, MyInt(r)); + } + + tend = microsec_clock::universal_time(); ++ + +
+That is, we'll try insertions of different range sizes (num_elements
) and different
+iterations (num_iterations
). We'll end up with the same number of elements in the
+list, so we can compare results easier. To measure burst deallocation we'll preprocess several
+iterator ranges and measure the time to completely empty the list:
+
++ +//Now preprocess ranges to erase + std::vector <typename bi::list<MyInt, IntAllocator>::iterator> ranges_to_erase; + ranges_to_erase.push_back(l.begin()); + + for(unsigned int r = 0; r != num_iterations; ++r){ + typename bi::list<MyInt, IntAllocator>::iterator next_pos(ranges_to_erase[r]); + std::size_t n = num_elements; + while(n--){ ++next_pos; } + ranges_to_erase.push_back(next_pos); + } + +//Measure range erasure function + tini = microsec_clock::universal_time(); + + for(unsigned int r = 0; r != num_iterations; ++r){ + l.erase(ranges_to_erase[r], ranges_to_erase[r+1]); + } + + tend = microsec_clock::universal_time(); +
+As shown, a good place to take advantage of burst allocation is range erasure along with clear()
and destructors.
+
+We'll test 3 different allocators: +
std::allocator<int>
, the standard
+ allocator that comes with the C++ compilerba::allocator<int, 1>
, a STL
+ conformant allocator around dlmalloc
and dlfree
(no burst allocation).ba::allocator<int, 2>
, burst-allocation capable
+ allocator.[num_iterations, num_elements]
combinations: [200,
+10000], [2000, 1000], [20000, 100] and [200000, 10]
+
+
++For each allocator we'll show the following information: +
+The graph shows that allocation speed per element is constant for AllocatorPlusV1 (surprisingly, not so constant for
+Visual's standard allocator) and this is pretty logical since each element needs an allocation. AllocatorPlusV2 is
+using DL_MULTIALLOC_DEFAULT_CONTIGUOUS
which allocates contiguous nodes up to 4096 bytes. That's why
+AllocatorPlusV2 performance stabilizes for big ranges reaching to a 3x speedup.
+
+Burst deallocation improvements are more modest, why? + +
list
must construct a node chain (iterating through all nodes) before calling
+ deallocate_individual
. That is, we must traverse nodes twice (once to construct the chain and
+ another one internally in the internal C function)+This means that burst deallocation is just a loop of plain deallocations, amortizing costs like mutex locks, checks, +function calls and returns and other common operations. Nevertheless, the test shows a 30% speedup when comparing to +node to node deallocation with quite big ranges (>1000) and still noticeable speedups (20%) with small ranges (>=10). +
+ ++Now we'll execute the same test in a Linux (Suse 10.1) virtual machine (VMWare 6) using gcc-4.1: +
+ ++In Linux, the standard allocator (ptmalloc2, based on DLmalloc) performs very well, and the lock reduction +achieved by its per-thread-pool approach offers a similar performance than burst allocation (AllocatorPlusV2) +for small ranges. With bigger ranges (>=100) AllocatorPlusV2 maintains (more or less) its performance while +StdAllocator can't keep up. Like in WinXP Visual 7.1 tests, AllocatorPlusV2 is faster (up to 2x in some cases) +than AllocatorPlusV1 for big ranges. +
+ +
+Burst deallocation is not faster than the standard allocator in Linux, surely because ptmalloc2 reduces
+mutex locks just like burst deallocation, but burst deallocation needs to traverse the range to be
+erased twice as explained before. When comparing to AllocatorPlusV1, that is, a wrapper around
+dlmalloc
and dlfree
burst deallocation (AllocatorPlusV2) is 30% faster for
+big ranges.
+
+When developing allocators to improve node containers performance, one of the widely used approaches +is using simple segregated storage (e.g. Boost.Pool). +The idea behind simple segregated storage is to divide a big memory portion ("block") allocated through a +general purpose memory manager into fixed-size "nodes". +A memory manager ("pool") holds a list of "blocks" and a singly linked list of free nodes. +Since all nodes are the same size, Simple Segregated Storage can't serve nodes of different sizes. +To solve this, an application creates a different pool for each size and usually different types with the same size +share the same pool. +
+ ++The free node list is intrusive (since the node is free, the list pointer is built in the node), +so that there is no size overhead for each node. Simple Segregated Storage is very fast: +
+Now the question is: when does the pool return fully free "blocks" (allocated each time the free node list was empty) to the +general purpose memory manager so that they can be reused by other pools or non-pool allocators? The common answer is: usually never. +Why? Simple Segregated Storage erases all bookeeping data from the node and this makes "block" deallocation very expensive: the pool +needs to traverse the free nodes list (nodes are not ordered by their original block) and check the address of each node +against the address of each "block" just to check if all the nodes of a "block" are free and the block can be safely deallocated. If +we have B blocks and N nodes, this is a O(N*B) operation. Usually N (and B) can be quite high and this operation can take minutes +if the application has allocated several tens of thousands of nodes. That's why the option to trim the memory is provided only as +an optional and explicit operation. +
+ ++A manual operation makes allocator changes non-portable (How is that manual trimming activated? How long does it take?) so many +applications just ignore this operation. This also has a drawback: if an application allocates a big amount of nodes in a container +and then destroys the container, the pool will hold a big list of free nodes but that memory won't be reused by any other container +not using a pool with the same node size, leading to excessive memory waste. +
+ ++Can we obtain the small size overhead of simple segregated storage and a fast, continuous, memory trimming operation? +The idea behind adaptive pools is simple: when deallocating a node, an efficient self-trimming node pool needs +a fast access to the "block" containing the node. If so, the pool can check how many free nodes are in that "block" +(the "block" has a header storing the number of free blocks) and if all nodes are free, the pool can return +that block to the general purpose memory manager. +
+ ++The problem is that storing bookeeping data (a pointer to the block containing that node) in a node has +a size overhead. But there is an extremely fast way to obtain a pointer to the block containing the node with zero +size overhead per-node: memory alignment. +
+ ++Suppose a 4KB "block" also aligned to 4KB: a node just needs a mask operation to obtain the address of the block. +Once we have it, we can check the number of free nodes and deallocate the block if all the nodes are free. And once we +have fast access to the "block" we can develop a more complicated trimming scheme that favors the creation of free "blocks" +and improves trimming probabilities. +
+ ++Allocators usually have a payload that store bookeeping information for deallocation. If an adaptive pool allocates 4KB +bytes aligned to 4KB, it will really allocate 4KB + payload bytes where the start of the non-payload section is aligned +to 4KB. This means that the memory adjacent to this aligned block won't be suitable for a new 4KB aligned allocation leading +to excessive memory waste (a 4KB allocation would waste next 4KB - payload bytes of memory). +
+ + +There is a simple solution: allocate (4KB - payload bytes) bytes aligned to 4KB. This simple +solution make contiguous memory usable for a future 4KB aligned allocation leading to optimal memory usage. That's why +knowing the size of the payload is important for aligned memory allocations where the size of the allocation is not fixed. + + + ++Once we have access from a node to its block, we can implement different policies for them. +Once a block is completely free we can just deallocate it but the allocation pattern +could lead to several half-used blocks, increasing in practice the memory usage +we'll trying to minimize. +
+ ++Alternatively, we can implement a strategy in order to minimize the number of in-use +blocks. We can try to allocate nodes always from blocks with less free nodes, +leading, at least in theory, to more fully used blocks and more fully free blocks that +could be deallocated faster. Obviously, this block management must be very fast: searching for the +most used block can ruin performance, specially when comparing with the efficient +simple segregated storage. The same happens when deallocating a node: if block +management needs too many operations deallocation can have very bad performance. +
+ ++Adaptive pools can require excessive alignment requirements when many nodes are allocated in a chunk. +If a block is configured to hold at least 256 nodes and the size of a node is 24 bytes, the required size +for the block is (block payload + 20*256) > 4096 -> 8196 bytes, which also needs 8196 bytes alignment. This +means that increasing the number of nodes a block can hold also increases the required alignment. +
+ ++We can improve the excessive alignment if we divide each block in sub-blocks. The first sub-block contains the +payload +of the block and the rest of sub-blocks just store a pointer to the first sub-block. This leads to an +additional degree of freedom: block overhead. The overhead per block (number of nodes * node size)/ (block size) +determines the size (and the alignment) of a sub-block, whereas the number of nodes per block determines the +number of sub-blocks per block. This trick maintains the alignment requirement lower +than the approach that does not use sub-blocks and constant with the number of nodes per block. +On the other hand, deallocation requires an additional indirection (from node to subblock and from subblock +to the block header) and a bit more complicated block initialization. +
+ +adaptive_pool
+adaptive_pool
is built above the aligned memory allocation function
+from our modified DLmalloc library. Alignment must be power of two:
+
+
+void* boost_cont_memalign(size_t bytes, size_t alignment); ++ + +
+adaptive_pool
is implemented as follows:
+
+
This is the declaration of adaptive_pool
:
+
+const std::size_t ADP_nodes_per_block = unspecified; +const std::size_t ADP_max_free_blocks = unspecified +const std::size_t ADP_overhead_percent = unspecified; +const std::size_t ADP_only_alignment = unspecified; + +template < class T + , unsigned Version = 2 + , std::size_t NodesPerBlock = ADP_nodes_per_block + , std::size_t MaxFreeBlocks = ADP_max_free_blocks + , std::size_t OverheadPercent = ADP_overhead_percent + > +class adaptive_pool; ++ + +
+Description of the template parameters: +
ADP_only_alignment
, then no sub-blocks will be used
+ and NodesPerBlock will determine the alignment required by blocks.
+To allocate a new node adaptive_pool
needs to:
+
+To deallocate a new node adaptive_pool
needs to:
+
adaptive_pool
+In this test we'll measure 8 allocators: + +
ba::allocator<int, 1>
, a STL
+ conformant allocator around dlmalloc
and dlfree
(no burst allocation).ba::allocator<int, 2>
, burst-allocation capable
+ allocator.+All pool allocators (allocators 3-8) are configured with a minimum of 256 nodes per block. +For a int list, in 32 bit systems, the required alignment +is 4096 bytes for 3 and 4 allocators (adaptive pools with no sub-blocks) and 2048 bytes for 5 and 6 allocators +(adaptive pools with sub-blocks). +
+ ++For each allocator we'll show the following information: +
The test will be the same as the used in the section Testing burst allocation. +Now let's see the results of the test in a Windows XP machine (AMD 1.47 Ghz) using Visual C++ 2003: +
+ +
+In general, insertion speeds are better for pool allocators than for DLmalloc. Analyzing only V1 allocators
+(no burst allocation) we can see that SimpleSegregatedStorageV1
performs best, followed by
+AdPoolAlignOnlyV1
and AllocPool2PercentV1
. This is quite logical: simple
+segregated storage minimizes all block operations because it does
+not deallocate free blocks on the fly. AdPoolAlignOnlyV1
and AllocPool2PercentV1
+perform quite similarly.
+
+When comparing V2 allocators (burst-allocation enabled) we see that performance of all pool allocators is very
+similar. The reason for this is that the slightly higher costs of adaptive pools are practically minimized when the pool
+allocates several nodes in a row: all needed blocks are preallocated in a single call, error handling in loops can be
+simplified leading to tighter loops... Curiously, the performance of a burst-enabled DLmalloc
+(AllocatorPlusV2
) is on par with pools with burst allocation. This is not a surprise: DLmalloc burst
+allocation uses exactly the same mechanism as pools: allocating big block and create nodes from them. The difference, though,
+is in the overhead per node.
+
+When analyzing deallocations, results for V1 allocators are pretty similar, with SimpleSegregatedStorageV1
+leading the test. We must remember that SimpleSegregatedStorageV1
does not perform any block managing
+and deallocation, so this is the expected behavior. Adaptive pool deallocation times are exactly between DLmalloc
+(AllocatorPlusV1
) and segregated storage times. Adaptive pools perform a quite complicated block management
+ordering free blocks by free node count and deallocating completely free blocks if the number of them is superior to
+a configure value (2 blocks in this test). As expected AdPoolAlignOnlyV2
performs marginally better than
+AllocPool2PercentV2
because deallocating a node allocated with AllocPool2PercentV2
(adaptive
+pool with sub-blocks) needs an additional indirection to obtain the block header from the node (first we reach the
+header of the sub-block and this header stores the address of the block header).
+
+When analyzing burst-enabled allocators, however, the picture changes so that adaptive pools perform similarly to
+than SimpleSegregatedStorageV2
. Again, burst-allocation allows the elimination of some redundant checks
+and management operations that yield to better performance and amortization of block managing costs.
+
+For big ranges, V2 (burst) adaptive pools are even better than simple segregated storage by a small margin. +
+ +
+Now let's see the amount of memory allocated from the operating system by DLmalloc with each allocator
+(V1 and V2 practically allocate the same amount of memory so we only represent V2 allocators). DLmalloc
+wrapper (AllocatorPlusV2
) allocates more memory than the rest of allocators. Allocated nodes are
+12 byte in 32 bit systems so DLmalloc needs 12 byte + overhead (4 bytes) = 16 bytes per node. With 2000000
+nodes this yields to aproximately 32 MB. For pools, since overhead is practically inexistent the total memory
+is around 24MB (12 bytes per node).
+
+In the previous graph we can see the overhead (in percentage) for each allocator. As expected, DLmalloc
+offers a 33% overhead (4 bytes per each 12 byte node) whereas pools offer less than 2% overhead (remember that
+we've configured AllocPool2PercentVX
with a maximum of 2 percent overhead). As expected,
+SimpleSegregatedStorageVX
offers the minimum overhead but adaptive pools (with their additional
+headers for block management) are also very size-efficient.
+
+The previous picture shows how much in-use memory DLmalloc reports after deallocating all the nodes. +Note that this does not count the OS memory DLmalloc still caches, but memory that have been allocated +through DLmalloc and has not been deallocated yet through DLmalloc. In other words, memory still +hold by pools. +
+ +
+As expected DLmalloc wrappers (AllocatorPlusVX
) have deallocated all their memory (no leaks!).
+Also, simple segregated storage allocators have not deallocated a single block so they still hold all the memory.
+Adaptive pools (AdPoolAlignOnlyVX
and AllocPool2PercentVX
) only hold the number of blocks
+we've configured: 2 blocks x 4096 bytes per block = 8192 bytes. So we clearly show that adaptive pools
+not only offer a low memory overhead and fast allocation times, but they also return memory to the underlying
+memory allocator!
+
+Once allocators return memory to DLmalloc, DLmalloc can trim its cached OS memory and return +that memory to the OS. The following graph shows the amount of memory DLmalloc caches after all +nodes have been deallocated: +
+ ++Cached memory usually depends on the number of contiguous free pages that DLmalloc holds +internally so numbers are not important. The only thing that this graph shows is that adaptive pools, +after returning memory to DLmalloc, encourage also memory trimming so that DLmalloc can +return memory to the operating system. +
+ ++Now we'll execute the same test in a Linux (Suse 10.1) virtual machine (VMWare 6) using gcc-4.1: +
+ ++Insertion speeds for V1 allocators are more or less the expected with segregated storage leading the group +and non-pool DLmalloc wrapper being the slowest one. V2 (burst) allocators are faster with similar +performance between adaptive pools and simple segregated storage as the number of the insertion range grows. +
+ +
+Erasure speeds are similar to Windows, with SimpleSegregatedStorageVX
leading the test,
+and adaptive pools come next.
+