algorithm/string/doc/design.xml

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE library PUBLIC "-//Boost//DTD BoostBook XML V1.0//EN"
"http://www.boost.org/tools/boostbook/dtd/boostbook.dtd">
<section id="string_algo.design" last-revision="$Date$">
    <title>Design Topics</title>

    <using-namespace name="boost"/>
    <using-namespace name="boost::algorithm"/>

    <section id="string_algo.string">
        <title>String Representation</title>

        <para>
            As the name suggest, this library works mainly with strings. However, in the context of this library,
            a string is not restricted to any particular implementation (like <code>std::basic_string</code>),
            rather it is a concept. This allows the algorithms in this library to be reused for any string type,
            that satisfies the given requirements.
        </para>
        <para>
            <emphasis role="bold">Definition:</emphasis> A string is a 
            <ulink url="../../libs/utility/Collection.html">collection</ulink> of characters accessible in sequential
            ordered fashion. Character is any value type with "cheap" copying and assignment.                
        </para>
        <para>
            First requirement of string-type is that it must accessible using 
            <link linkend="string_algo.collection_traits">collection traits</link>. This facility allows to access
            the elements inside the string in a uniform iterator-based fashion. 
            This facility actually requires lessen requirements then collection concept. It implements 
            <ulink url="../../libs/algorithm/string/doc/external_concepts.html">external</ulink> collection interface.
            This is sufficient for our library
        </para>
        <para>            
            Second requirement defines the way in which are the characters stored in the string. Algorithms in 
            this library work with an assumption, that copying a character is cheaper then allocating an extra 
            storage to cache results. This is natural assumption for common character types. Algorithms will 
            work even if this requirement will not be satisfied, however at the cost of performance degradation.
        <para>
        </para>
            In addition some algorithms have additional requirements on the string-type. Particularly, it is required,
            that an algorithm can create a new string of the given type. In this case, it is required, that
            the type satisfies the sequence (Std &sect;23.1.1) requirements.
        </para>
        <para>
            In the reference and also in the code, requirement on the string type is designated by the name of
            template argument. <code>CollectionT</code> means that the basic collection requirements must be held.
            <code>SequenceT</code> designates extended sequence requirements.
        </para>
    </section>
    
    
    <section id="string_algo.iterator_range">
        <title><code>iterator_range</code> class</title>

        <para>
            An <classname>iterator_range</classname> is an encapsulation of a pair of iterators that
            delimit a sequence (or, a range). This concept is widely used by 
            sequence manipulating algorithms. Although being so useful, there no direct support 
            for it in the standard library (The closest thing is that some algorithms return a pair of iterators). 
            Instead all STL algorithms have two distinct parameters for beginning and end of a range. This design 
            is natural for implementation of generic algorithms, but it forbids to work with a range as a single value. 
        </para> 
        <para>
            It is possible to encapsulate a range in <code>std::pair&lt;&gt;</code>, but
            the <code>std::pair&lt;&gt;</code> is a too generic encapsulation, so it is not best match for a range.
            For instance, it does not enforce that begin and end iterators are of the same type.
        </para>
        <para>
            Naturally the range concept is heavily used also in this library. During the development of
            the library, it was discovered, that there is a need for a reasonable encapsulation for it.
            A core part of the library deals with substring searching algorithms. Any such an algorithm,
            returns a range delimiting the result of the search. <code>std::pair&lt;&gt;</code> was considered as 
            unsuitable. Therefore the <code>iterator_range</code> was defined.
        </para>
        <para>
            The intention of the <code>iterator_range</code> class is to manage a range as a single value and provide 
            a basic interface for common operations. Its interface is similar to that of collection. 
            In addition of <code>begin()</code>
            and <code>end()</code> accessors, it has member functions for checking if the range is empty,
            or to determine the size of the range. It has also a set of member typedefs that extract
            type information from the encapsulated iterators. As such, the interface is compatible with 
            the <link linkend="string_algo.collection_traits">collection traits</link> requirements so
            it is possible to use this class as a parameter to many algorithms in this library.
        </para>
        <para>
            <classname>iterator_range</classname> will be moved to Boost.Range library in the future
            releases. Internal version will be deprecated then.
        </para>
    </section>
        
    <section id="string_algo.collection_traits">
        <title>Collection Traits</title>

        <para>
            Collection traits provide uniform access to different types of 
            <ulink url="../../libs/utility/Collection.html">collections</ulink> . 
            This functionality allows to write generic algorithms which work with several 
            different kinds of collections. For this library it means, that, for instance,
            many algorithms work with <code>std::string</code> as well as with <code>char[]</code>.
            This facility implements
            <ulink url="../../libs/algorithm/string/doc/external_concepts.html">external</ulink> collection
            concept.
        </para>
        <para>
            The following collection types are supported:
            <itemizedlist>
                <listitem>
                    Standard containers
                </listitem>
                <listitem>
                    Built-in arrays (like int[])
                </listitem>
                <listitem>
                    Null terminated strings (this includes char[],wchar_t[],char*, and wchar_t*)
                </listitem>
                <listitem>
                    std::pair&lt;iterator,iterator&gt;
                </listitem>
            </itemizedlist>
        </para>
        <para>
            Collection traits support a subset of container concept (Std &sect;23.1). This subset 
            can be described as an input container concept, e.g. a container with an immutable content. 
            Its definition can be found in the header <headername>boost/algorithm/string/collection_traits.hpp</headername>.
        </para>
        <para>
            In the table C denotes a container and c is an object of C. 
        </para>
        <table>
            <title>Collection Traits</title>
            <tgroup cols="3" align="left">
                <thead>
                    <row>   
                        <entry>Name</entry>
                        <entry>Standard collection equivalent</entry>
                        <entry>Description</entry>
                    </row>Maeterlinck
                </thead>
                <tbody>
                    <row>
                        <entry><classname>value_type_of&lt;C&gt;</classname>::type</entry>
                        <entry><code>C::value_type</code></entry>
                        <entry>Type of contained values</entry>
                    </row>
                    <row>
                        <entry><classname>difference_type_of&lt;C&gt;</classname>::type</entry>
                        <entry><code>C::difference_type</code></entry>
                        <entry>difference type of the collection</entry>
                    </row>
                    <row>
                        <entry><classname>iterator_of&lt;C&gt;</classname>::type</entry>
                        <entry><code>C::iterator</code></entry>
                        <entry>iterator type of the collection</entry>
                    </row>
                    <row>
                        <entry><classname>const_iterator_of&lt;C&gt;</classname>::type</entry>
                        <entry><code>C::const_iterator</code></entry>
                        <entry>const_iterator type of the collection</entry>
                    </row>
                    <row>
                        <entry><classname>result_iterator_of&lt;C&gt;</classname>::type</entry>
                        <entry></entry>
                        <entry>
                            result_iterator type of the collection. This type maps to <code>C::iterator</code>
                            for mutable collection and <code>C::const_iterator</code> for const collection.
                        </entry>
                    </row>
                    <row>
                        <entry><functionname>begin(c)</functionname></entry>
                        <entry><code>c.begin()</code></entry>
                        <entry>
                            Gets the iterator pointing to the start of the collection.
                        </entry>
                    </row>
                    <row>
                        <entry><functionname>end(c)</functionname></entry>
                        <entry><code>c.end()</code></entry>
                        <entry>
                            Gets the iterator pointing to the end of the collection.
                        </entry>
                    </row>
                    <row>
                        <entry><functionname>size(c)</functionname></entry>
                        <entry><code>c.size()</code></entry>
                        <entry>
                            Gets the size of the collection.
                        </entry>
                    </row>
                    <row>
                        <entry><functionname>empty(c)</functionname></entry>
                        <entry><code>c.empty()</code></entry>
                        <entry>
                            Checks if the collection is empty.
                        </entry>
                    </row>
                </tbody>
            </tgroup>
        </table>

        <para>
            The collection traits are only a temporary part of this library. They will be replaced in the future
            releases by Boost.Range library. Use of the internal implementation will be deprecated then.
        </para>
    
    </section>
    <section id="string_algo.sequence_traits">
        <title>Sequence Traits</title>

        <para>
            Major difference between <code>std::list</code> and <code>std::vector</code> is not in the interfaces
            they provide, rather in the inner details of the class and the way how it performs 
            various operation. The problem is that it is not possible to infer this difference from the 
            definitions of classes without some special mechanism.
            However some algorithms can run significantly faster with the knowledge of the properties
            of a particular container.
        </para>
        <para>
            Sequence traits allows one to specify additional properties of a sequence container (see Std.&sect;32.2).
            These properties are then used by algorithms to select optimized handling for some operations.
            The sequence traits are declared in the header 
            <headername>boost/algorithm/string/sequence_traits.hpp</headername>.
        </para>

        <para>
            In the table C denotes a container and c is an object of C.
        </para>
        <table>
            <title>Sequence Traits</title>
            <tgroup cols="2" align="left">
                <thead>
                    <row>   
                        <entry>Trait</entry>
                        <entry>Description</entry>
                    </row>
                </thead>
                <tbody>
                    <row>
                        <entry><classname>has_native_replace&lt;C&gt;</classname>::value</entry>
                        <entry>Specifies that the sequence has std::string like replace method</entry>
                    </row>
                    <row>
                        <entry><classname>has_stable_iterators&lt;C&gt;</classname>::value</entry>
                        <entry>
                            Specifies that the sequence has stable iterators. It means,
                            that operations like <code>insert</code>/<code>erase</code>/<code>replace</code> 
                            do not invalidate iterators.
                        </entry>
                    </row>
                    <row>
                        <entry><classname>has_const_time_insert&lt;C&gt;</classname>::value</entry>
                        <entry>
                            Specifies that the insert method of the sequence has 
                            constant time complexity.
                        </entry>
                    </row>
                    <row>
                        <entry><classname>has_const_time_erase&lt;C&gt;</classname>::value</entry>
                        <entry>
                            Specifies that the erase method of the sequence has constant time complexity
                        </entry>
                    </row>
                    </tbody>
            </tgroup>
        </table>
        
        <para>
            Current implementation contains specializations for std::list&lt;T&gt; and
            std::basic_string&lt;T&gt; from the standard library and SGI's std::rope&lt;T&gt; and std::slist&lt;T&gt;.
        </para>
    </section>
    <section id="string_algo.find">
        <title>Find Algorithms</title>

        <para>
            Find algorithms have similar functionality to <code>std::search()</code> algorithm. They provide a different 
            interface which is more suitable for common string operations. 
            Instead of returning just the start of matching subsequence they return a range which is necessary 
            when the length of the matching subsequence is not known beforehand. 
            This feature also allows a partitioning of  the input sequence into three 
            parts: a prefix, a substring and a suffix. 
        </para>
        <para>
            Another difference is an addition of various searching methods besides find_first, including find_regex. 
        </para>
        <para>
            It the library, find algorithms are implemented in terms of 
            <link linkend="string_algo.finder_concept">Finders</link>. Finders are used also by other facilities 
            (replace,split).
            For convenience, there are also function wrappers for these finders to simplify find operations.
        </para>
        <para>
            Currently the library contains only naive implementation of find algorithms with complexity 
            O(n * m) where n is the size of the input sequence and m is the size of the search sequence. 
            There are algorithms with complexity O(n), but for smaller sequence a constant overhead is 
            rather big. For small m &lt;&lt; n (m by magnitude smaller than n) the current implementation 
            provides acceptable efficiency. 
            Even the C++ standard defines the required complexity for search algorithm as O(n * m). 
            It is possible that a future version of library will also contain algorithms with linear 
            complexity as an option
        </para>
    </section>
    <section id="string_algo.replace">
        <title>Replace Algorithms</title>

        <para>
            The implementation of replace algorithms follows the layered structure of the library. The 
            lower layer implements generic substitution of a range in the input sequence. 
            This layer takes a <link linkend="string_algo.finder_concept">Finder</link> object and a 
            <link linkend="string_algo.formatter_concept">Formatter</link> object as an input. These two 
            functors define what to replace and what to replace it with. The upper layer functions 
            are just wrapping calls to the lower layer. Finders are shared with the find and split facility. 
        </para>
        <para>
            As usual, the implementation of the lower layer is designed to work with a generic sequence while
            taking an advantage of specific features if possible 
            (by using <link linkend="string_algo.sequence_traits">Sequence traits</link>)
        </para>         
    </section>
    <section id="string_algo.split">
        <title>Find Iterators &amp; Split Algorithms</title>

        <para>
               Find iterators are a logical extension of <link linkend="string_algo.find">find facility</link>.
            Instead of searching for one match, the whole input can be iteratively searched for multiple matches.
            The result of the search is then used to partition the input. It depends on the algorithms which parts 
            are returned as the result. It can be the matching parts (<classname>find_iterator</classname>) of the parts in
            between (<classname>split_iterator</classname>). 
        </para>
        <para>
            In addition the split algorithms like <functionname>find_all()</functionname> and <functionname>split()</functionname>
            can simplify the common operations. They use a find iterator to search the whole input and copy the 
            matches they found into the supplied container.
        </para>
    </section>
    <section id="string_algo.exception">
        <title>Exception Safety</title>

        <para>
            The library provides some exceptions safety guaranties under following assumptions:
		 	<orderedlist numeration="arabic">
                <listitem>
                    <para>        
                        All types that are used as a template arguments or passed as arguments to the 
                        facilities in this library provide <emphasis>basic exception guarantee</emphasis>.
                    </para>
                </listitem>
                <listitem>
                    <para>
                        If the types mentioned in the first assumption can provide 
                        <emphasis>strong exception guarantee</emphasis> for their const operations, some algorithm
                        can provide stronger guaranties.
                    </para>
                </listitem>
            </orderedlist>
        </para>        
        <para>
            Unless stated otherwise, all facilities and algorithms in this library have <emphasis>basic exception guarantee</emphasis>.
        </para>
    </section>
</section>