forked from boostorg/regex
251 lines
10 KiB
HTML
251 lines
10 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
|
||
<html>
|
||
<head>
|
||
<title>Boost.Regex: Index</title>
|
||
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
||
<link rel="stylesheet" type="text/css" href="../../../boost.css">
|
||
</head>
|
||
<body>
|
||
<P>
|
||
<TABLE id="Table1" cellSpacing="1" cellPadding="1" width="100%" border="0">
|
||
<TR>
|
||
<td valign="top" width="300">
|
||
<h3><a href="../../../index.htm"><img height="86" width="277" alt="C++ Boost" src="../../../c++boost.gif" border="0"></a></h3>
|
||
</td>
|
||
<TD width="353">
|
||
<H1 align="center">Boost.Regex</H1>
|
||
<H2 align="center">Understanding Captures</H2>
|
||
</TD>
|
||
<td width="50">
|
||
<h3><a href="index.html"><img height="45" width="43" alt="Boost.Regex Index" src="uarrow.gif" border="0"></a></h3>
|
||
</td>
|
||
</TR>
|
||
</TABLE>
|
||
</P>
|
||
<HR>
|
||
<p></p>
|
||
<H2>Marked sub-expressions</H2>
|
||
<P>Every time a Perl regular expression contains a parenthesis group (), it spits
|
||
out an extra field, known as a marked sub-expression, for example the
|
||
expression:</P>
|
||
<PRE>(\w+)\W+(\w+)</PRE>
|
||
<P>
|
||
Has two marked sub-expressions (known as $1 and $2 respectively), in addition
|
||
the complete match is known as $&, everything before the first match as $`,
|
||
and everything after the match as $'. So if the above expression is
|
||
searched for within "@abc def--", then we obtain:</P>
|
||
<BLOCKQUOTE dir="ltr" style="MARGIN-RIGHT: 0px">
|
||
<P>
|
||
<TABLE id="Table2" cellSpacing="1" cellPadding="1" width="300" border="0">
|
||
<TR>
|
||
<TD>
|
||
<P dir="ltr" style="MARGIN-RIGHT: 0px">$`</P>
|
||
</TD>
|
||
<TD>"@"</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD>$&</TD>
|
||
<TD>"abc def"</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD>$1</TD>
|
||
<TD>"abc"</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD>$2</TD>
|
||
<TD>"def"</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD>$'</TD>
|
||
<TD>"--"</TD>
|
||
</TR>
|
||
</TABLE>
|
||
</P>
|
||
</BLOCKQUOTE>
|
||
<P>In Boost.regex all these are accessible via the <A href="match_results.html">match_results</A>
|
||
class that gets filled in when calling one of the matching algorithms (<A href="regex_search.html">regex_search</A>,
|
||
<A href="regex_match.html">regex_match</A>, or <A href="regex_iterator.html">regex_iterator</A>).
|
||
So given:</P>
|
||
<PRE>boost::match_results<IteratorType> m;</PRE>
|
||
<P>The Perl and Boost.Regex equivalents are as follows:</P>
|
||
<BLOCKQUOTE dir="ltr" style="MARGIN-RIGHT: 0px">
|
||
<P>
|
||
<TABLE id="Table3" cellSpacing="1" cellPadding="1" width="300" border="0">
|
||
<TR>
|
||
<TD><STRONG>Perl</STRONG></TD>
|
||
<TD><STRONG>Boost.Regex</STRONG></TD>
|
||
</TR>
|
||
<TR>
|
||
<TD>$`</TD>
|
||
<TD>m.prefix()</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD>$&</TD>
|
||
<TD>m[0]</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD>$n</TD>
|
||
<TD>m[n]</TD>
|
||
</TR>
|
||
<TR>
|
||
<TD>$'</TD>
|
||
<TD>m.suffix()</TD>
|
||
</TR>
|
||
</TABLE>
|
||
</P>
|
||
</BLOCKQUOTE>
|
||
<P>
|
||
<P>In Boost.Regex each sub-expression match is represented by a <A href="sub_match.html">
|
||
sub_match</A> object, this is basically just a pair of iterators denoting
|
||
the start and end possition of the sub-expression match, but there are some
|
||
additional operators provided so that objects of type sub_match behave a lot
|
||
like a std::basic_string: for example they are implicitly <A href="sub_match.html#m3">
|
||
convertible to a basic_string</A>, they can be <A href="sub_match.html#o21">compared
|
||
to a string</A>, <A href="sub_match.html#o81">added to a string</A>, or <A href="sub_match.html#oi">
|
||
streamed out to an output stream</A>.</P>
|
||
<H2>Unmatched Sub-Expressions</H2>
|
||
<P>When a regular expression match is found there is no need for all of the marked
|
||
sub-expressions to have participated in the match, for example the expression:</P>
|
||
<P>(abc)|(def)</P>
|
||
<P>can match either $1 or $2, but never both at the same time. In
|
||
Boost.Regex you can determine which sub-expressions matched by accessing the <A href="sub_match.html#m1">
|
||
sub_match::matched</A> data member.</P>
|
||
<H2>Repeated Captures</H2>
|
||
<P>When a marked sub-expression is repeated, then the sub-expression gets
|
||
"captured" multiple times, however normally only the final capture is
|
||
available, for example if</P>
|
||
<PRE>(?:(\w+)\W+)+</PRE>
|
||
<P>is matched against</P>
|
||
<PRE>one fine day</PRE>
|
||
<P>Then $1 will contain the string "day", and all the previous captures will have
|
||
been forgotten.</P>
|
||
<P>However, Boost.Regex has an experimental feature that allows all the capture
|
||
information to be retained - this is accessed either via the <A href="match_results.html#m17">
|
||
match_results::captures</A> member function or the <A href="sub_match.html#m8">sub_match::captures</A>
|
||
member function. These functions return a container that contains a
|
||
sequence of all the captures obtained during the regular expression
|
||
matching. The following example program shows how this information may be
|
||
used:</P>
|
||
<PRE>#include <boost/regex.hpp>
|
||
#include <iostream>
|
||
|
||
|
||
void print_captures(const std::string& regx, const std::string& text)
|
||
{
|
||
boost::regex e(regx);
|
||
boost::smatch what;
|
||
std::cout << "Expression: \"" << regx << "\"\n";
|
||
std::cout << "Text: \"" << text << "\"\n";
|
||
if(boost::regex_match(text, what, e, boost::match_extra))
|
||
{
|
||
unsigned i, j;
|
||
std::cout << "** Match found **\n Sub-Expressions:\n";
|
||
for(i = 0; i < what.size(); ++i)
|
||
std::cout << " $" << i << " = \"" << what[i] << "\"\n";
|
||
std::cout << " Captures:\n";
|
||
for(i = 0; i < what.size(); ++i)
|
||
{
|
||
std::cout << " $" << i << " = {";
|
||
for(j = 0; j < what.captures(i).size(); ++j)
|
||
{
|
||
if(j)
|
||
std::cout << ", ";
|
||
else
|
||
std::cout << " ";
|
||
std::cout << "\"" << what.captures(i)[j] << "\"";
|
||
}
|
||
std::cout << " }\n";
|
||
}
|
||
}
|
||
else
|
||
{
|
||
std::cout << "** No Match found **\n";
|
||
}
|
||
}
|
||
|
||
int main(int , char* [])
|
||
{
|
||
print_captures("(([[:lower:]]+)|([[:upper:]]+))+", "aBBcccDDDDDeeeeeeee");
|
||
print_captures("(.*)bar|(.*)bah", "abcbar");
|
||
print_captures("(.*)bar|(.*)bah", "abcbah");
|
||
print_captures("^(?:(\\w+)|(?>\\W+))*$", "now is the time for all good men to come to the aid of the party");
|
||
return 0;
|
||
}</PRE>
|
||
<P>Which produces the following output:</P>
|
||
<PRE>Expression: "(([[:lower:]]+)|([[:upper:]]+))+"
|
||
Text: "aBBcccDDDDDeeeeeeee"
|
||
** Match found **
|
||
Sub-Expressions:
|
||
$0 = "aBBcccDDDDDeeeeeeee"
|
||
$1 = "eeeeeeee"
|
||
$2 = "eeeeeeee"
|
||
$3 = "DDDDD"
|
||
Captures:
|
||
$0 = { "aBBcccDDDDDeeeeeeee" }
|
||
$1 = { "a", "BB", "ccc", "DDDDD", "eeeeeeee" }
|
||
$2 = { "a", "ccc", "eeeeeeee" }
|
||
$3 = { "BB", "DDDDD" }
|
||
Expression: "(.*)bar|(.*)bah"
|
||
Text: "abcbar"
|
||
** Match found **
|
||
Sub-Expressions:
|
||
$0 = "abcbar"
|
||
$1 = "abc"
|
||
$2 = ""
|
||
Captures:
|
||
$0 = { "abcbar" }
|
||
$1 = { "abc" }
|
||
$2 = { }
|
||
Expression: "(.*)bar|(.*)bah"
|
||
Text: "abcbah"
|
||
** Match found **
|
||
Sub-Expressions:
|
||
$0 = "abcbah"
|
||
$1 = ""
|
||
$2 = "abc"
|
||
Captures:
|
||
$0 = { "abcbah" }
|
||
$1 = { }
|
||
$2 = { "abc" }
|
||
Expression: "^(?:(\w+)|(?>\W+))*$"
|
||
Text: "now is the time for all good men to come to the aid of the party"
|
||
** Match found **
|
||
Sub-Expressions:
|
||
$0 = "now is the time for all good men to come to the aid of the party"
|
||
$1 = "party"
|
||
Captures:
|
||
$0 = { "now is the time for all good men to come to the aid of the party" }
|
||
$1 = { "now", "is", "the", "time", "for", "all", "good", "men", "to", "come", "to", "the", "aid", "of", "the", "party" }
|
||
</PRE>
|
||
<P>Unfortunately enabling this feature has an impact on performance (even if you
|
||
don't use it), and a much bigger impact if you do use it, therefore to use this
|
||
feature you need to:</P>
|
||
<UL>
|
||
<LI>
|
||
Define BOOST_REGEX_MATCH_EXTRA for all translation units including the library
|
||
source (the best way to do this is to uncomment this define in <A href="../../../boost/regex/user.hpp">
|
||
boost/regex/user.hpp</A>
|
||
and then rebuild everything.
|
||
<LI>
|
||
Pass the <A href="match_flag_type.html">match_extra flag</A> to the particular
|
||
algorithms where you actually need the captures information (<A href="regex_search.html">regex_search</A>,
|
||
<A href="regex_match.html">regex_match</A>, or <A href="regex_iterator.html">regex_iterator</A>).
|
||
</LI>
|
||
</UL>
|
||
<P>
|
||
<HR>
|
||
<P></P>
|
||
<P></P>
|
||
<p>Revised
|
||
<!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%d %B, %Y" startspan -->
|
||
12 Dec 2003
|
||
<!--webbot bot="Timestamp" endspan i-checksum="39359" --></p>
|
||
<p><i><EFBFBD> Copyright John Maddock
|
||
<!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%Y" startspan --> 2003<!--webbot bot="Timestamp" endspan i-checksum="39359" --></i></p>
|
||
<P><I>Use, modification and distribution are subject to the Boost Software License,
|
||
Version 1.0. (See accompanying file <A href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</A>
|
||
or copy at <A href="http://www.boost.org/LICENSE_1_0.txt">http://www.boost.org/LICENSE_1_0.txt</A>)</I></P>
|
||
</body>
|
||
</html>
|
||
|