Regular expressions are a form of pattern-matching that are often used in text
processing; many users will be familiar with the Unix utilities grep, sed and
awk, and the programming language Perl, each of which make extensive use of
regular expressions. Traditionally C++ users have been limited to the POSIX
C API's for manipulating regular expressions, and while Boost.Regex does provide
these API's, they do not represent the best way to use the library. For example
Boost.Regex can cope with wide character strings, or search and replace operations
(in a manner analogous to either sed or Perl), something that traditional C
libraries can not do.
</p>
<p>
The class <ahref="ref/basic_regex.html"title="basic_regex"><codeclass="computeroutput"><spanclass="identifier">basic_regex</span></code></a>
is the key class in this library; it represents a "machine readable"
regular expression, and is very closely modeled on <codeclass="computeroutput"><spanclass="identifier">std</span><spanclass="special">::</span><spanclass="identifier">basic_string</span></code>,
think of it as a string plus the actual state-machine required by the regular
expression algorithms. Like <codeclass="computeroutput"><spanclass="identifier">std</span><spanclass="special">::</span><spanclass="identifier">basic_string</span></code>
there are two typedefs that are almost always the means by which this class
Note how we had to add some extra escapes to the expression: remember that
the escape is seen once by the C++ compiler, before it gets to be seen by the
regular expression engine, consequently escapes in regular expressions have
to be doubled up when embedding them in C/C++ code. Also note that all the
examples assume that your compiler supports argument-dependent-lookup lookup,
if yours doesn't (for example VC6), then you will have to add some <codeclass="computeroutput"><spanclass="identifier">boost</span><spanclass="special">::</span></code> prefixes
to some of the function calls in the examples.
</p>
<p>
Those of you who are familiar with credit card processing, will have realized
that while the format used above is suitable for human readable card numbers,
it does not represent the format required by online credit card systems; these
require the number as a string of 16 (or possibly 15) digits, without any intervening
spaces. What we need is a means to convert easily between the two formats,
and this is where search and replace comes in. Those who are familiar with
the utilities sed and Perl will already be ahead here; we need two strings
- one a regular expression - the other a "format string" that provides
a description of the text to replace the match with. In Boost.Regex this search
and replace operation is performed with the algorithm <ahref="ref/regex_replace.html"title="regex_replace"><codeclass="computeroutput"><spanclass="identifier">regex_replace</span></code></a>, for our credit card
example we can write two algorithms like this to provide the format conversions:
</p>
<preclass="programlisting">
<spanclass="comment">// match any format with the regular expression:
The algorithms <ahref="ref/regex_search.html"title="regex_search"><codeclass="computeroutput"><spanclass="identifier">regex_search</span></code></a>
and <ahref="ref/regex_match.html"title="regex_match"><codeclass="computeroutput"><spanclass="identifier">regex_match</span></code></a>
make use of <ahref="ref/match_results.html"title="match_results"><codeclass="computeroutput"><spanclass="identifier">match_results</span></code></a>
to report what matched; the difference between these algorithms is that <ahref="ref/regex_match.html"title="regex_match"><codeclass="computeroutput"><spanclass="identifier">regex_match</span></code></a>
will only find matches that consume <spanclass="emphasis"><em>all</em></span> of the input text,
where as <ahref="ref/regex_search.html"title="regex_search"><codeclass="computeroutput"><spanclass="identifier">regex_search</span></code></a>
will search for a match anywhere within the text being matched.
</p>
<p>
Note that these algorithms are not restricted to searching regular C-strings,
any bidirectional iterator type can be searched, allowing for the possibility
of seamlessly searching almost any kind of data.
</p>
<p>
For search and replace operations, in addition to the algorithm <ahref="ref/regex_replace.html"title="regex_replace"><codeclass="computeroutput"><spanclass="identifier">regex_replace</span></code></a> that we have already
seen, the <ahref="ref/match_results.html"title="match_results"><codeclass="computeroutput"><spanclass="identifier">match_results</span></code></a>
class has a <codeclass="computeroutput"><spanclass="identifier">format</span></code> member that
takes the result of a match and a format string, and produces a new string
by merging the two.
</p>
<p>
For iterating through all occurences of an expression within a text, there
are two iterator types: <ahref="ref/regex_iterator.html"title="regex_iterator"><codeclass="computeroutput"><spanclass="identifier">regex_iterator</span></code></a> will enumerate over
the <ahref="ref/match_results.html"title="match_results"><codeclass="computeroutput"><spanclass="identifier">match_results</span></code></a>
objects found, while <ahref="ref/regex_token_iterator.html"title="regex_token_iterator"><codeclass="computeroutput"><spanclass="identifier">regex_token_iterator</span></code></a> will enumerate
a series of strings (similar to perl style split operations).
</p>
<p>
For those that dislike templates, there is a high level wrapper class <ahref="ref/deprecated_interfaces/old_regex.html"title="High Level Class RegEx (Deprecated)"><codeclass="computeroutput"><spanclass="identifier">RegEx</span></code></a>
that is an encapsulation of the lower level template code - it provides a simplified
interface for those that don't need the full power of the library, and supports
only narrow characters, and the "extended" regular expression syntax.
This class is now deprecated as it does not form part of the regular expressions
C++ standard library proposal.
</p>
<p>
The POSIX API functions: <ahref="ref/posix.html#boost_regex.ref.posix.regcomp"><codeclass="computeroutput"><spanclass="identifier">regcomp</span></code></a>, <ahref="ref/posix.html#boost_regex.ref.posix.regexec"><codeclass="computeroutput"><spanclass="identifier">regexec</span></code></a>, <ahref="ref/posix.html#boost_regex.ref.posix.regfree"><codeclass="computeroutput"><spanclass="identifier">regfree</span></code></a> and [regerr], are available
in both narrow character and Unicode versions, and are provided for those who
need compatibility with these API's.
</p>
<p>
Finally, note that the library now has <ahref="background_information/locale.html"title="Localization">run-time
localization support</a>, and recognizes the full POSIX regular expression
syntax - including advanced features like multi-character collating elements
and equivalence classes - as well as providing compatibility with other regular
expression libraries including GNU and BSD4 regex packages, PCRE and Perl 5.