Files
boost_regex/doc/Attic/captures.html
2003-12-13 12:28:48 +00:00

251 lines
10 KiB
HTML
Raw Blame History

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>Boost.Regex: Index</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<link rel="stylesheet" type="text/css" href="../../../boost.css">
</head>
<body>
<P>
<TABLE id="Table1" cellSpacing="1" cellPadding="1" width="100%" border="0">
<TR>
<td valign="top" width="300">
<h3><a href="../../../index.htm"><img height="86" width="277" alt="C++ Boost" src="../../../c++boost.gif" border="0"></a></h3>
</td>
<TD width="353">
<H1 align="center">Boost.Regex</H1>
<H2 align="center">Understanding Captures</H2>
</TD>
<td width="50">
<h3><a href="index.html"><img height="45" width="43" alt="Boost.Regex Index" src="uarrow.gif" border="0"></a></h3>
</td>
</TR>
</TABLE>
</P>
<HR>
<p></p>
<H2>Marked sub-expressions</H2>
<P>Every time a Perl regular expression contains a parenthesis group (), it spits
out an extra field, known as a marked sub-expression, for example the
expression:</P>
<PRE>(\w+)\W+(\w+)</PRE>
<P>
Has two marked sub-expressions (known as $1 and $2 respectively), in addition
the complete match is known as $&amp;, everything before the first match as $`,
and everything after the match as $'.&nbsp; So if the above expression is
searched for within "@abc def--", then we obtain:</P>
<BLOCKQUOTE dir="ltr" style="MARGIN-RIGHT: 0px">
<P>
<TABLE id="Table2" cellSpacing="1" cellPadding="1" width="300" border="0">
<TR>
<TD>
<P dir="ltr" style="MARGIN-RIGHT: 0px">$`</P>
</TD>
<TD>"@"</TD>
</TR>
<TR>
<TD>$&amp;</TD>
<TD>"abc def"</TD>
</TR>
<TR>
<TD>$1</TD>
<TD>"abc"</TD>
</TR>
<TR>
<TD>$2</TD>
<TD>"def"</TD>
</TR>
<TR>
<TD>$'</TD>
<TD>"--"</TD>
</TR>
</TABLE>
</P>
</BLOCKQUOTE>
<P>In Boost.regex all these are accessible via the <A href="match_results.html">match_results</A>
class that gets filled in when calling one of the matching algorithms (<A href="regex_search.html">regex_search</A>,
<A href="regex_match.html">regex_match</A>, or <A href="regex_iterator.html">regex_iterator</A>).&nbsp;
So given:</P>
<PRE>boost::match_results&lt;IteratorType&gt; m;</PRE>
<P>The Perl and Boost.Regex equivalents are as follows:</P>
<BLOCKQUOTE dir="ltr" style="MARGIN-RIGHT: 0px">
<P>
<TABLE id="Table3" cellSpacing="1" cellPadding="1" width="300" border="0">
<TR>
<TD><STRONG>Perl</STRONG></TD>
<TD><STRONG>Boost.Regex</STRONG></TD>
</TR>
<TR>
<TD>$`</TD>
<TD>m.prefix()</TD>
</TR>
<TR>
<TD>$&amp;</TD>
<TD>m[0]</TD>
</TR>
<TR>
<TD>$n</TD>
<TD>m[n]</TD>
</TR>
<TR>
<TD>$'</TD>
<TD>m.suffix()</TD>
</TR>
</TABLE>
</P>
</BLOCKQUOTE>
<P>
<P>In Boost.Regex each sub-expression match is represented by a <A href="sub_match.html">
sub_match</A> object, this is basically just a pair of iterators denoting
the start and end possition of the sub-expression match, but there are some
additional operators provided so that objects of type sub_match behave a lot
like a std::basic_string: for example they are implicitly <A href="sub_match.html#m3">
convertible to a basic_string</A>, they can be <A href="sub_match.html#o21">compared
to a string</A>, <A href="sub_match.html#o81">added to a string</A>, or <A href="sub_match.html#oi">
streamed out to an output stream</A>.</P>
<H2>Unmatched Sub-Expressions</H2>
<P>When a regular expression match is found there is no need for all of the marked
sub-expressions to have participated in the match, for example the expression:</P>
<P>(abc)|(def)</P>
<P>can match either $1 or $2, but never both at the same time.&nbsp; In
Boost.Regex you can determine which sub-expressions matched by accessing the <A href="sub_match.html#m1">
sub_match::matched</A> data member.</P>
<H2>Repeated Captures</H2>
<P>When a marked sub-expression is repeated, then the sub-expression gets
"captured" multiple times, however normally only the final capture is
available, for example if</P>
<PRE>(?:(\w+)\W+)+</PRE>
<P>is matched against</P>
<PRE>one fine day</PRE>
<P>Then $1 will contain the string "day", and all the previous captures will have
been forgotten.</P>
<P>However, Boost.Regex has an experimental feature that allows all the capture
information to be retained - this is accessed either via the <A href="match_results.html#m17">
match_results::captures</A> member function or the <A href="sub_match.html#m8">sub_match::captures</A>
member function.&nbsp; These functions return a container that contains a
sequence of all the captures obtained during the regular expression
matching.&nbsp; The following example program shows how this information may be
used:</P>
<PRE>#include &lt;boost/regex.hpp&gt;
#include &lt;iostream&gt;
void print_captures(const std::string&amp; regx, const std::string&amp; text)
{
boost::regex e(regx);
boost::smatch what;
std::cout &lt;&lt; "Expression: \"" &lt;&lt; regx &lt;&lt; "\"\n";
std::cout &lt;&lt; "Text: \"" &lt;&lt; text &lt;&lt; "\"\n";
if(boost::regex_match(text, what, e, boost::match_extra))
{
unsigned i, j;
std::cout &lt;&lt; "** Match found **\n Sub-Expressions:\n";
for(i = 0; i &lt; what.size(); ++i)
std::cout &lt;&lt; " $" &lt;&lt; i &lt;&lt; " = \"" &lt;&lt; what[i] &lt;&lt; "\"\n";
std::cout &lt;&lt; " Captures:\n";
for(i = 0; i &lt; what.size(); ++i)
{
std::cout &lt;&lt; " $" &lt;&lt; i &lt;&lt; " = {";
for(j = 0; j &lt; what.captures(i).size(); ++j)
{
if(j)
std::cout &lt;&lt; ", ";
else
std::cout &lt;&lt; " ";
std::cout &lt;&lt; "\"" &lt;&lt; what.captures(i)[j] &lt;&lt; "\"";
}
std::cout &lt;&lt; " }\n";
}
}
else
{
std::cout &lt;&lt; "** No Match found **\n";
}
}
int main(int , char* [])
{
print_captures("(([[:lower:]]+)|([[:upper:]]+))+", "aBBcccDDDDDeeeeeeee");
print_captures("(.*)bar|(.*)bah", "abcbar");
print_captures("(.*)bar|(.*)bah", "abcbah");
print_captures("^(?:(\\w+)|(?&gt;\\W+))*$", "now is the time for all good men to come to the aid of the party");
return 0;
}</PRE>
<P>Which produces the following output:</P>
<PRE>Expression: "(([[:lower:]]+)|([[:upper:]]+))+"
Text: "aBBcccDDDDDeeeeeeee"
** Match found **
Sub-Expressions:
$0 = "aBBcccDDDDDeeeeeeee"
$1 = "eeeeeeee"
$2 = "eeeeeeee"
$3 = "DDDDD"
Captures:
$0 = { "aBBcccDDDDDeeeeeeee" }
$1 = { "a", "BB", "ccc", "DDDDD", "eeeeeeee" }
$2 = { "a", "ccc", "eeeeeeee" }
$3 = { "BB", "DDDDD" }
Expression: "(.*)bar|(.*)bah"
Text: "abcbar"
** Match found **
Sub-Expressions:
$0 = "abcbar"
$1 = "abc"
$2 = ""
Captures:
$0 = { "abcbar" }
$1 = { "abc" }
$2 = { }
Expression: "(.*)bar|(.*)bah"
Text: "abcbah"
** Match found **
Sub-Expressions:
$0 = "abcbah"
$1 = ""
$2 = "abc"
Captures:
$0 = { "abcbah" }
$1 = { }
$2 = { "abc" }
Expression: "^(?:(\w+)|(?&gt;\W+))*$"
Text: "now is the time for all good men to come to the aid of the party"
** Match found **
Sub-Expressions:
$0 = "now is the time for all good men to come to the aid of the party"
$1 = "party"
Captures:
$0 = { "now is the time for all good men to come to the aid of the party" }
$1 = { "now", "is", "the", "time", "for", "all", "good", "men", "to", "come", "to", "the", "aid", "of", "the", "party" }
</PRE>
<P>Unfortunately enabling this feature has an impact on performance (even if you
don't use it), and a much bigger impact if you do use it, therefore to use this
feature you need to:</P>
<UL>
<LI>
Define BOOST_REGEX_MATCH_EXTRA for all translation units including the library
source (the best way to do this is to uncomment this define in <A href="../../../boost/regex/user.hpp">
boost/regex/user.hpp</A>
and then rebuild everything.
<LI>
Pass the <A href="match_flag_type.html">match_extra flag</A> to the particular
algorithms where you actually need the captures information (<A href="regex_search.html">regex_search</A>,
<A href="regex_match.html">regex_match</A>, or <A href="regex_iterator.html">regex_iterator</A>).
</LI>
</UL>
<P>
<HR>
<P></P>
<P></P>
<p>Revised&nbsp;
<!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%d %B, %Y" startspan -->
12&nbsp;Dec 2003
<!--webbot bot="Timestamp" endspan i-checksum="39359" --></p>
<p><i><EFBFBD> Copyright John Maddock&nbsp;
<!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%Y" startspan --> 2003<!--webbot bot="Timestamp" endspan i-checksum="39359" --></i></p>
<P><I>Use, modification and distribution are subject to the Boost Software License,
Version 1.0. (See accompanying file <A href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</A>
or copy at <A href="http://www.boost.org/LICENSE_1_0.txt">http://www.boost.org/LICENSE_1_0.txt</A>)</I></P>
</body>
</html>