From bf9350aa1659f52d51d6304df3aa21bf1969a033 Mon Sep 17 00:00:00 2001
From: John Maddock
+
+
+
+
+
+
+
+
+
+ Boost.Regex
+ Understanding Captures
+
+
+
+
Every time a Perl regular expression contains a parenthesis group (), it spits + out an extra field, known as a marked sub-expression, for example the + expression:
+(\w+)\W+(\w+)+
+ Has two marked sub-expressions (known as $1 and $2 respectively), in addition + the complete match is known as $&, everything before the first match as $`, + and everything after the match as $'. So if the above expression is + searched for within "@abc def--", then we obtain:
++++
+
+ ++ ++ +$`
+"@" ++ +$& +"abc def" ++ +$1 +"abc" ++ +$2 +"def" ++ +$' +"--" +
In Boost.regex all these are accessible via the match_results + class that gets filled in when calling one of the matching algorithms (regex_search, + regex_match, or regex_iterator). + So given:
+boost::match_results<IteratorType> m;+
The Perl and Boost.Regex equivalents are as follows:
++++
+
+ ++ +Perl +Boost.Regex ++ +$` +m.prefix() ++ +$& +m[0] ++ +$n +m[n] ++ +$' +m.suffix() +
+
In Boost.Regex each sub-expression match is represented by a + sub_match object, this is basically just a pair of iterators denoting + the start and end possition of the sub-expression match, but there are some + additional operators provided so that objects of type sub_match behave a lot + like a std::basic_string: for example they are implicitly + convertible to a basic_string, they can be compared + to a string, added to a string, or + streamed out to an output stream.
+When a regular expression match is found there is no need for all of the marked + sub-expressions to have participated in the match, for example the expression:
+(abc)|(def)
+can match either $1 or $2, but never both at the same time. In + Boost.Regex you can determine which sub-expressions matched by accessing the + sub_match::matched data member.
+When a marked sub-expression is repeated, then the sub-expression gets + "captured" multiple times, however normally only the final capture is + available, for example if
+(?:(\w+)\W+)++
is matched against
+one fine day+
Then $1 will contain the string "day", and all the previous captures will have + been forgotten.
+However, Boost.Regex has an experimental feature that allows all the capture + information to be retained - this is accessed either via the + match_results::captures member function or the sub_match::captures + member function. These functions return a container that contains a + sequence of all the captures obtained during the regular expression + matching. The following example program shows how this information may be + used:
+#include <boost/regex.hpp> +#include <iostream> + + +void print_captures(const std::string& regx, const std::string& text) +{ + boost::regex e(regx); + boost::smatch what; + std::cout << "Expression: \"" << regx << "\"\n"; + std::cout << "Text: \"" << text << "\"\n"; + if(boost::regex_match(text, what, e, boost::match_extra)) + { + unsigned i, j; + std::cout << "** Match found **\n Sub-Expressions:\n"; + for(i = 0; i < what.size(); ++i) + std::cout << " $" << i << " = \"" << what[i] << "\"\n"; + std::cout << " Captures:\n"; + for(i = 0; i < what.size(); ++i) + { + std::cout << " $" << i << " = {"; + for(j = 0; j < what.captures(i).size(); ++j) + { + if(j) + std::cout << ", "; + else + std::cout << " "; + std::cout << "\"" << what.captures(i)[j] << "\""; + } + std::cout << " }\n"; + } + } + else + { + std::cout << "** No Match found **\n"; + } +} + +int main(int , char* []) +{ + print_captures("(([[:lower:]]+)|([[:upper:]]+))+", "aBBcccDDDDDeeeeeeee"); + print_captures("(.*)bar|(.*)bah", "abcbar"); + print_captures("(.*)bar|(.*)bah", "abcbah"); + print_captures("^(?:(\\w+)|(?>\\W+))*$", "now is the time for all good men to come to the aid of the party"); + return 0; +}+
Which produces the following output:
+Expression: "(([[:lower:]]+)|([[:upper:]]+))+" +Text: "aBBcccDDDDDeeeeeeee" +** Match found ** + Sub-Expressions: + $0 = "aBBcccDDDDDeeeeeeee" + $1 = "eeeeeeee" + $2 = "eeeeeeee" + $3 = "DDDDD" + Captures: + $0 = { "aBBcccDDDDDeeeeeeee" } + $1 = { "a", "BB", "ccc", "DDDDD", "eeeeeeee" } + $2 = { "a", "ccc", "eeeeeeee" } + $3 = { "BB", "DDDDD" } +Expression: "(.*)bar|(.*)bah" +Text: "abcbar" +** Match found ** + Sub-Expressions: + $0 = "abcbar" + $1 = "abc" + $2 = "" + Captures: + $0 = { "abcbar" } + $1 = { "abc" } + $2 = { } +Expression: "(.*)bar|(.*)bah" +Text: "abcbah" +** Match found ** + Sub-Expressions: + $0 = "abcbah" + $1 = "" + $2 = "abc" + Captures: + $0 = { "abcbah" } + $1 = { } + $2 = { "abc" } +Expression: "^(?:(\w+)|(?>\W+))*$" +Text: "now is the time for all good men to come to the aid of the party" +** Match found ** + Sub-Expressions: + $0 = "now is the time for all good men to come to the aid of the party" + $1 = "party" + Captures: + $0 = { "now is the time for all good men to come to the aid of the party" } + $1 = { "now", "is", "the", "time", "for", "all", "good", "men", "to", "come", "to", "the", "aid", "of", "the", "party" } ++
Unfortunately enabling this feature has an impact on performance (even if you + don't use it), and a much bigger impact if you do use it, therefore to use this + feature you need to:
++
Revised + + 12 Dec 2003 +
+© Copyright John Maddock + 2003
+Use, modification and distribution are subject to the Boost Software License, + Version 1.0. (See accompanying file LICENSE_1_0.txt + or copy at http://www.boost.org/LICENSE_1_0.txt)
+ + + diff --git a/doc/Attic/match_flag_type.html b/doc/Attic/match_flag_type.html index 3e206ae5..dc9c2dbe 100644 --- a/doc/Attic/match_flag_type.html +++ b/doc/Attic/match_flag_type.html @@ -46,6 +46,7 @@ static const match_flag_type match_any; static const match_flag_type match_not_null; static const match_flag_type match_continuous; static const match_flag_type match_partial; +static const match_flag_type match_single_line; static const match_flag_type match_prev_avail; static const match_flag_type match_not_dot_newline; static const match_flag_type match_not_dot_null; @@ -167,6 +168,20 @@ static const match_flag_type format_all; in a full match. +match_prev_avail
@@ -259,8 +274,7 @@ static const match_flag_type format_all; 24 Oct 2003© Copyright John Maddock 1998- - - 2003
+ 2003Use, modification and distribution are subject to the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
diff --git a/doc/Attic/match_results.html b/doc/Attic/match_results.html index ab210452..208067d6 100644 --- a/doc/Attic/match_results.html +++ b/doc/Attic/match_results.html @@ -2,23 +2,22 @@
- |
+ |
Boost.Regexclass match_results |
- |
#include <boost/regex.hpp>
+#include <boost/regex.hpp>
Regular expressions are different from many simple pattern-matching algorithms in that as well as finding an overall match they can also produce sub-expression matches: each sub-expression being delimited in the pattern by a pair of parenthesis (...). There has to be some method for reporting sub-expression matches back to the user: this is achieved this by defining a class match_results that acts as an indexed collection of sub-expression - matches, each sub-expression match being contained in an object of type - sub_match .
+ matches, each sub-expression match being contained in an object of type + sub_match .Template class match_results denotes a collection of character sequences representing the result of a regular expression match. Objects of type - match_results are passed to the algorithms regex_match - and regex_search, and are returned by the - iterator regex_iterator . Storage for + match_results are passed to the algorithms regex_match + and regex_search, and are returned by the + iterator regex_iterator . Storage for the collection is allocated and freed as necessary by the member functions of class match_results.
The template class match_results conforms to the requirements of a Sequence, as @@ -51,8 +50,7 @@ const-qualified Sequences are supported.
Class template match_results is most commonly used as one of the typedefs cmatch, wcmatch, smatch, or wsmatch:
--template <class BidirectionalIterator, +template <class BidirectionalIterator, class Allocator = allocator<sub_match<BidirectionalIterator> > class match_results; @@ -78,52 +76,58 @@ public: typedef basic_string<char_type> string_type; // construct/copy/destroy: - explicit match_results(const Allocator& a = Allocator()); - match_results(const match_results& m); - match_results& operator=(const match_results& m); + explicit match_results(const Allocator& a = Allocator()); + match_results(const match_results& m); + match_results& operator=(const match_results& m); ~match_results(); // size: - size_type size() const; - size_type max_size() const; - bool empty() const; + size_type size() const; + size_type max_size() const; + bool empty() const; // element access: - difference_type length(int sub = 0) const; - difference_type position(unsigned int sub = 0) const; - string_type str(int sub = 0) const; - const_reference operator[](int n) const; + difference_type length(int sub = 0) const; + difference_type position(unsigned int sub = 0) const; + string_type str(int sub = 0) const; + const_reference operator[](int n) const; - const_reference prefix() const; + const_reference prefix() const; - const_reference suffix() const; - const_iterator begin() const; - const_iterator end() const; + const_reference suffix() const; + const_iterator begin() const; + const_iterator end() const; // format: template <class OutputIterator> - OutputIterator format(OutputIterator out, + OutputIterator format(OutputIterator out, const string_type& fmt, match_flag_type flags = format_default) const; - string_type format(const string_type& fmt, + string_type format(const string_type& fmt, match_flag_type flags = format_default) const; - allocator_type get_allocator() const; - void swap(match_results& that); + allocator_type get_allocator() const; + void swap(match_results& that); + +#ifdef BOOST_REGEX_MATCH_EXTRA + typedef typename value_type::capture_sequence_type capture_sequence_type; + const capture_sequence_type& captures(std::size_t i)const; +#endif + }; template <class BidirectionalIterator, class Allocator> -bool operator == (const match_results<BidirectionalIterator, Allocator>& m1, +bool operator == (const match_results<BidirectionalIterator, Allocator>& m1, const match_results<BidirectionalIterator, Allocator>& m2); template <class BidirectionalIterator, class Allocator> -bool operator != (const match_results<BidirectionalIterator, Allocator>& m1, +bool operator != (const match_results<BidirectionalIterator, Allocator>& m1, const match_results<BidirectionalIterator, Allocator>& m2); template <class charT, class traits, class BidirectionalIterator, class Allocator> basic_ostream<charT, traits>& - operator << (basic_ostream<charT, traits>& os, + operator << (basic_ostream<charT, traits>& os, const match_results<BidirectionalIterator, Allocator>& m); template <class BidirectionalIterator, class Allocator> -void swap(match_results<BidirectionalIterator, Allocator>& m1, +void swap(match_results<BidirectionalIterator, Allocator>& m1, match_results<BidirectionalIterator, Allocator>& m2);Description
@@ -139,42 +143,41 @@ match_results(const Allocator& a = Allocator()); of this function are indicated in the table:- +
+
- + - Element
+ Value
- + - empty()
+ true
- + - size()
+ 0
- - -+ - str()
+ basic_string<charT>()
@@ -190,82 +193,81 @@ match_results& operator=(const match_results& m); indicated in the table:- +
+
- + - Element
+ Value
- + - empty()
+ m.empty().
- + - size()
+ m.size().
- + - str(n)
+ m.str(n) for all integers n < m.size().
- + - prefix()
+ m.prefix().
- + - suffix()
+ m.suffix().
- + - (*this)[n]
+ m[n] for all integers n < m.size().
- + - length(n)
+ m.length(n) for all integers n < m.size().
- - -+ - position(n)
+ m.position(n) for all integers n < m.size().
match_results size
@@ -342,11 +344,10 @@ const_iterator end()const;Effects: Returns a terminating iterator that enumerates over all the marked sub-expression matches stored in *this.
match_results reformatting
--template <class OutputIterator> +template <class OutputIterator> OutputIterator format(OutputIterator out, const string_type& fmt, - match_flag_type flags = format_default); + match_flag_type flags = format_default);Requires: The type OutputIterator conforms to the Output Iterator @@ -356,34 +357,34 @@ OutputIterator format(OutputIterator out, OutputIterator out. For each format specifier or escape sequence in fmt, replace that sequence with either the character(s) it represents, or the sequence of characters within *this to which it refers. The bitmasks specified - in flags determines what - format specifiers or escape sequences are recognized, by default this is + in flags determines what + format specifiers or escape sequences are recognized, by default this is the format used by ECMA-262, ECMAScript Language Specification, Chapter 15 part 5.4.11 String.prototype.replace.
Returns: out.
string_type format(const string_type& fmt, - match_flag_type flags = format_default); + match_flag_type flags = format_default);Effects: Returns a copy of the string fmt. For each format specifier or escape sequence in fmt, replace that sequence with either the character(s) it represents, or the sequence of characters within *this to - which it refers. The bitmasks specified in flags - determines what format specifiers or escape sequences - are recognized, by default this is the format used by ECMA-262, + which it refers. The bitmasks specified in flags + determines what format specifiers or escape sequences + are recognized, by default this is the format used by ECMA-262, ECMAScript Language Specification, Chapter 15 part 5.4.11 String.prototype.replace.
--allocator_type get_allocator()const; +Allocator access
+allocator_type get_allocator()const;Effects: Returns a copy of the Allocator that was passed to the object's constructor.
--void swap(match_results& that); -+Swap
+void swap(match_results& that); +Effects: Swaps the contents of the two sequences.
@@ -392,6 +393,36 @@ void swap(match_results& that); sequence of matched sub-expressions that were in*this
.Complexity: constant time.
+Captures
+typedef typename value_type::capture_sequence_type capture_sequence_type;+Defines an implementation-specific type that satisfies the requirements of + a standard library Sequence (21.1.1 including the optional Table 68 + operations), whose value_type is a sub_match<BidirectionalIterator>. This + type happens to be std::vector<sub_match<BidirectionalIterator> >, + but you shouldn't actually rely on that.
+const capture_sequence_type& captures(std::size_t i)const;+Effects: returns a sequence containing all the captures + obtained for sub-expression i.
+Returns:
+(*this)[i].captures();
Preconditions: the library must be built and used with + BOOST_REGEX_MATCH_EXTRA defined, and you must pass the flag + match_extra to the regex matching functions (regex_match, + regex_search, regex_iterator + or regex_token_iterator) in order for + this member function to be defined and return useful information.
+Rationale: Enabling this feature has several consequences: +
+
template <class BidirectionalIterator, class Allocator> bool operator == (const match_results<BidirectionalIterator, Allocator>& m1, @@ -418,10 +449,10 @@ void swap(match_results<BidirectionalIterator, Allocator>& m1, 24 Oct 2003© Copyright John Maddock 1998- - - 2003
+ 2003Use, modification and distribution are subject to the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)