2003-05-17 11:45:48 +00:00
|
|
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
|
|
|
|
|
<html>
|
|
|
|
|
<head>
|
|
|
|
|
<title>Boost.Regex: Introduction</title>
|
|
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
|
|
|
<link rel="stylesheet" type="text/css" href="../../../boost.css">
|
|
|
|
|
</head>
|
|
|
|
|
<body>
|
|
|
|
|
<P>
|
|
|
|
|
<TABLE id="Table1" cellSpacing="1" cellPadding="1" width="100%" border="0">
|
|
|
|
|
<TR>
|
|
|
|
|
<td valign="top" width="300">
|
|
|
|
|
<h3><a href="../../../index.htm"><img height="86" width="277" alt="C++ Boost" src="../../../c++boost.gif" border="0"></a></h3>
|
|
|
|
|
</td>
|
|
|
|
|
<TD width="353">
|
|
|
|
|
<H1 align="center">Boost.Regex</H1>
|
|
|
|
|
<H2 align="center">Introduction</H2>
|
|
|
|
|
</TD>
|
|
|
|
|
<td width="50">
|
|
|
|
|
<h3><a href="index.html"><img height="45" width="43" alt="Boost.Regex Index" src="uarrow.gif" border="0"></a></h3>
|
|
|
|
|
</td>
|
|
|
|
|
</TR>
|
|
|
|
|
</TABLE>
|
|
|
|
|
</P>
|
|
|
|
|
<HR>
|
|
|
|
|
<p></p>
|
|
|
|
|
<P>Regular expressions are a form of pattern-matching that are often used in text
|
|
|
|
|
processing; many users will be familiar with the Unix utilities <I>grep</I>, <I>sed</I>
|
|
|
|
|
and <I>awk</I>, and the programming language <I>Perl</I>, each of which make
|
|
|
|
|
extensive use of regular expressions. Traditionally C++ users have been limited
|
|
|
|
|
to the POSIX C API's for manipulating regular expressions, and while regex++
|
|
|
|
|
does provide these API's, they do not represent the best way to use the
|
|
|
|
|
library. For example regex++ can cope with wide character strings, or search
|
|
|
|
|
and replace operations (in a manner analogous to either sed or Perl), something
|
|
|
|
|
that traditional C libraries can not do.</P>
|
|
|
|
|
<P>The class <A href="basic_regex.html">boost::basic_regex</A> is the key class in
|
|
|
|
|
this library; it represents a "machine readable" regular expression, and is
|
|
|
|
|
very closely modeled on std::basic_string, think of it as a string plus the
|
|
|
|
|
actual state-machine required by the regular expression algorithms. Like
|
|
|
|
|
std::basic_string there are two typedefs that are almost always the means by
|
|
|
|
|
which this class is referenced:</P>
|
|
|
|
|
<pre><B>namespace </B>boost{
|
|
|
|
|
|
|
|
|
|
<B>template</B> <<B>class</B> charT,
|
|
|
|
|
<B> class</B> traits = regex_traits<charT>,
|
|
|
|
|
<B>class</B> Allocator = std::allocator<charT> >
|
|
|
|
|
<B>class</B> basic_regex;
|
|
|
|
|
|
|
|
|
|
<B>typedef</B> basic_regex<<B>char</B>> regex;
|
|
|
|
|
<B>typedef</B> basic_regex<<B>wchar_t></B> wregex;
|
|
|
|
|
|
|
|
|
|
}</pre>
|
|
|
|
|
<P>To see how this library can be used, imagine that we are writing a credit card
|
|
|
|
|
processing application. Credit card numbers generally come as a string of
|
|
|
|
|
16-digits, separated into groups of 4-digits, and separated by either a space
|
|
|
|
|
or a hyphen. Before storing a credit card number in a database (not necessarily
|
|
|
|
|
something your customers will appreciate!), we may want to verify that the
|
|
|
|
|
number is in the correct format. To match any digit we could use the regular
|
|
|
|
|
expression [0-9], however ranges of characters like this are actually locale
|
|
|
|
|
dependent. Instead we should use the POSIX standard form [[:digit:]], or the
|
|
|
|
|
regex++ and Perl shorthand for this \d (note that many older libraries tended
|
|
|
|
|
to be hard-coded to the C-locale, consequently this was not an issue for them).
|
|
|
|
|
That leaves us with the following regular expression to validate credit card
|
|
|
|
|
number formats:</P>
|
2003-10-21 11:18:40 +00:00
|
|
|
|
<PRE>(\d{4}[- ]){3}\d{4}</PRE>
|
2003-05-17 11:45:48 +00:00
|
|
|
|
<P>Here the parenthesis act to group (and mark for future reference)
|
|
|
|
|
sub-expressions, and the {4} means "repeat exactly 4 times". This is an example
|
|
|
|
|
of the extended regular expression syntax used by Perl, awk and egrep. Regex++
|
|
|
|
|
also supports the older "basic" syntax used by sed and grep, but this is
|
|
|
|
|
generally less useful, unless you already have some basic regular expressions
|
|
|
|
|
that you need to reuse.</P>
|
|
|
|
|
<P>Now let's take that expression and place it in some C++ code to validate the
|
|
|
|
|
format of a credit card number:</P>
|
|
|
|
|
<PRE><B>bool</B> validate_card_format(<B>const</B> std::string s)
|
|
|
|
|
{
|
|
|
|
|
<B>static</B> <B>const</B> <A href="basic_regex.html">boost::regex</A> e("(\\d{4}[- ]){3}\\d{4}");
|
|
|
|
|
<B>return</B> <A href="regex_match.html">regex_match</A>(s, e);
|
|
|
|
|
}</PRE>
|
|
|
|
|
<P>Note how we had to add some extra escapes to the expression: remember that the
|
|
|
|
|
escape is seen once by the C++ compiler, before it gets to be seen by the
|
|
|
|
|
regular expression engine, consequently escapes in regular expressions have to
|
|
|
|
|
be doubled up when embedding them in C/C++ code. Also note that all the
|
|
|
|
|
examples assume that your compiler supports Koenig lookup, if yours doesn't
|
|
|
|
|
(for example VC6), then you will have to add some boost:: prefixes to some of
|
|
|
|
|
the function calls in the examples.</P>
|
|
|
|
|
<P>Those of you who are familiar with credit card processing, will have realized
|
|
|
|
|
that while the format used above is suitable for human readable card numbers,
|
|
|
|
|
it does not represent the format required by online credit card systems; these
|
|
|
|
|
require the number as a string of 16 (or possibly 15) digits, without any
|
|
|
|
|
intervening spaces. What we need is a means to convert easily between the two
|
|
|
|
|
formats, and this is where search and replace comes in. Those who are familiar
|
|
|
|
|
with the utilities <I>sed</I> and <I>Perl</I> will already be ahead here; we
|
|
|
|
|
need two strings - one a regular expression - the other a "<A href="format_syntax.html">format
|
|
|
|
|
string</A>" that provides a description of the text to replace the match
|
|
|
|
|
with. In regex++ this search and replace operation is performed with the
|
2003-10-21 11:18:40 +00:00
|
|
|
|
algorithm<A href="regex_replace.html"> regex_replace</A>, for our credit card
|
|
|
|
|
example we can write two algorithms like this to provide the format
|
|
|
|
|
conversions:</P>
|
2003-05-17 11:45:48 +00:00
|
|
|
|
<PRE><I>// match any format with the regular expression:
|
|
|
|
|
</I><B>const</B> boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z");
|
|
|
|
|
<B>const</B> std::string machine_format("\\1\\2\\3\\4");
|
|
|
|
|
<B>const</B> std::string human_format("\\1-\\2-\\3-\\4");
|
|
|
|
|
|
|
|
|
|
std::string machine_readable_card_number(<B>const</B> std::string s)
|
|
|
|
|
{
|
|
|
|
|
<B>return</B> <A href="regex_replace.html">regex_replace</A>(s, e, machine_format, boost::match_default | boost::format_sed);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
std::string human_readable_card_number(<B>const</B> std::string s)
|
|
|
|
|
{
|
|
|
|
|
<B>return</B> <A href="regex_replace.html">regex_replace</A>(s, e, human_format, boost::match_default | boost::format_sed);
|
|
|
|
|
}</PRE>
|
|
|
|
|
<P>Here we've used marked sub-expressions in the regular expression to split out
|
|
|
|
|
the four parts of the card number as separate fields, the format string then
|
|
|
|
|
uses the sed-like syntax to replace the matched text with the reformatted
|
|
|
|
|
version.</P>
|
|
|
|
|
<P>In the examples above, we haven't directly manipulated the results of a regular
|
|
|
|
|
expression match, however in general the result of a match contains a number of
|
|
|
|
|
sub-expression matches in addition to the overall match. When the library needs
|
|
|
|
|
to report a regular expression match it does so using an instance of the class <A href="match_results.html">
|
|
|
|
|
match_results</A>, as before there are typedefs of this class for the most
|
|
|
|
|
common cases:
|
|
|
|
|
</P>
|
|
|
|
|
<PRE><B>namespace </B>boost{
|
|
|
|
|
<B>typedef</B> match_results<<B>const</B> <B>char</B>*> cmatch;
|
|
|
|
|
<B>typedef</B> match_results<<B>const</B> <B>wchar_t</B>*> wcmatch;
|
|
|
|
|
<STRONG>typedef</STRONG> match_results<std::string::const_iterator> smatch;
|
|
|
|
|
<STRONG>typedef</STRONG> match_results<std::wstring::const_iterator> wsmatch;
|
|
|
|
|
}</PRE>
|
2003-10-21 11:18:40 +00:00
|
|
|
|
<P>The algorithms <A href="regex_search.html">regex_search</A> and <A href="regex_match.html">regex_match</A>
|
|
|
|
|
make use of match_results to report what matched; the difference between these
|
|
|
|
|
algorithms is that <A href="regex_match.html">regex_match</A> will only find
|
|
|
|
|
matches that consume <EM>all</EM> of the input text, where as <A href="regex_search.html">
|
|
|
|
|
regex_search</A> will <EM>search</EM> for a match anywhere within the text
|
|
|
|
|
being matched.</P>
|
2003-05-17 11:45:48 +00:00
|
|
|
|
<P>Note that these algorithms are not restricted to searching regular C-strings,
|
|
|
|
|
any bidirectional iterator type can be searched, allowing for the possibility
|
|
|
|
|
of seamlessly searching almost any kind of data.
|
|
|
|
|
</P>
|
2003-10-21 11:18:40 +00:00
|
|
|
|
<P>For search and replace operations, in addition to the algorithm <A href="regex_replace.html">
|
|
|
|
|
regex_replace</A> that we have already seen, the <A href="match_results.html">match_results</A>
|
|
|
|
|
class has a format member that takes the result of a match and a format string,
|
|
|
|
|
and produces a new string by merging the two.</P>
|
|
|
|
|
<P>For iterating through all occurences of an expression within a text, there are
|
|
|
|
|
two iterator types: <A href="regex_iterator.html">regex_iterator</A> will
|
|
|
|
|
enumerate over the <A href="match_results.html">match_results</A> objects
|
|
|
|
|
found, while <A href="regex_token_iterator.html">regex_token_iterator</A> will
|
|
|
|
|
enumerate a series of strings (similar to perl style split operations).</P>
|
2003-05-17 11:45:48 +00:00
|
|
|
|
<P>For those that dislike templates, there is a high level wrapper class RegEx
|
|
|
|
|
that is an encapsulation of the lower level template code - it provides a
|
|
|
|
|
simplified interface for those that don't need the full power of the library,
|
|
|
|
|
and supports only narrow characters, and the "extended" regular expression
|
2003-10-21 11:18:40 +00:00
|
|
|
|
syntax. This class is now deprecated as it does not form part of the regular
|
|
|
|
|
expressions C++ standard library proposal.
|
2003-05-17 11:45:48 +00:00
|
|
|
|
</P>
|
|
|
|
|
<P>The <A href="posix_api.html">POSIX API</A> functions: regcomp, regexec, regfree
|
|
|
|
|
and regerror, are available in both narrow character and Unicode versions, and
|
|
|
|
|
are provided for those who need compatibility with these API's.
|
|
|
|
|
</P>
|
|
|
|
|
<P>Finally, note that the library now has run-time <A href="localisation.html">localization</A>
|
|
|
|
|
support, and recognizes the full POSIX regular expression syntax - including
|
|
|
|
|
advanced features like multi-character collating elements and equivalence
|
|
|
|
|
classes - as well as providing compatibility with other regular expression
|
|
|
|
|
libraries including GNU and BSD4 regex packages, and to a more limited extent
|
|
|
|
|
Perl 5.
|
|
|
|
|
</P>
|
|
|
|
|
<P>
|
|
|
|
|
<HR>
|
|
|
|
|
<P></P>
|
|
|
|
|
<p>Revised
|
|
|
|
|
<!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%d %B, %Y" startspan -->
|
2003-10-24 10:51:38 +00:00
|
|
|
|
24 Oct 2003
|
|
|
|
|
<!--webbot bot="Timestamp" endspan i-checksum="39359" --></p>
|
|
|
|
|
<p><i><EFBFBD> Copyright John Maddock 1998-
|
|
|
|
|
<!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%Y" startspan -->
|
|
|
|
|
2003<!--webbot bot="Timestamp" endspan i-checksum="39359" --></i></p>
|
|
|
|
|
<P><I>Use, modification and distribution are subject to the Boost Software License,
|
|
|
|
|
Version 1.0. (See accompanying file <A href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</A>
|
|
|
|
|
or copy at <A href="http://www.boost.org/LICENSE_1_0.txt">http://www.boost.org/LICENSE_1_0.txt</A>)</I></P>
|
2003-05-17 11:45:48 +00:00
|
|
|
|
</body>
|
|
|
|
|
</html>
|