diff --git a/appendix.htm b/appendix.htm deleted file mode 100644 index ba0b3bdf..00000000 --- a/appendix.htm +++ /dev/null @@ -1,1304 +0,0 @@ - - - - - - -Regex++, Appendices - - - - -

 

- - - - - - -

C++ Boost

-

Regex++, Appendices.

-

Copyright (c) 1998-2001

-

Dr John Maddock

-

Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty.

-
- -
- -

Appendix 1: Implementation notes

- -

This is the first port of regex++ to the boost library, and is -based on regex++ 2.x, see changes.txt for a full list of changes -from the previous version. There are no known functionality bugs -except that POSIX style equivalence classes are only guaranteed -correct if the Win32 localization model is used (the default for -Win32 builds of the library).

- -

There are some aspects of the code that C++ puritans will -consider to be poor style, in particular the use of goto in some -of the algorithms. The code could be cleaned up, by changing to a -recursive implementation, although it is likely to be slower in -that case.

- -

The performance of the algorithms should be satisfactory in -most cases. For example the times taken to match the ftp response -expression "^([0-9]+)(\-| |$)(.*)$" against the string -"100- this is a line of ftp response which contains a -message string" are: BSD implementation 450 micro seconds, -GNU implementation 271 micro seconds, regex++ 127 micro seconds (Pentium -P90, Win32 console app under MS Windows 95).

- -

However it should be noted that there are some "pathological" -expressions which may require exponential time for matching; -these all involve nested repetition operators, for example -attempting to match the expression "(a*a)*b" against N -letter a's requires time proportional to 2N. -These expressions can (almost) always be rewritten in such a way -as to avoid the problem, for example "(a*a)*b" could be -rewritten as "a*b" which requires only time linearly -proportional to N to solve. In the general case, non-nested -repeat expressions require time proportional to N2, -however if the clauses are mutually exclusive then they can be -matched in linear time - this is the case with "a*b", -for each character the matcher will either match an "a" -or a "b" or fail, where as with "a*a" the -matcher can't tell which branch to take (the first "a" -or the second) and so has to try both. Be careful how you -write your regular expressions and avoid nested repeats if you -can! New to this version, some previously pathological cases have -been fixed - in particular searching for expressions which -contain leading repeats and/or leading literal strings should be -much faster than before. Literal strings are now searched for -using the Knuth/Morris/Pratt algorithm (this is used in -preference to the Boyer/More algorithm because it allows the -tracking of newline characters).

- -

Some aspects of the POSIX regular expression syntax are -implementation defined:

- - - -
- -

Appendix 2: Thread safety

- -

Class reg_expression<> and its typedefs regex and wregex -are thread safe, in that compiled regular expressions can safely -be shared between threads. The matching algorithms regex_match, -regex_search, regex_grep, regex_format and regex_merge are all re-entrant -and thread safe. Class match_results is now thread safe, in that -the results of a match can be safely copied from one thread to -another (for example one thread may find matches and push -match_results instances onto a queue, while another thread pops -them off the other end), otherwise use a separate instance of -match_results per thread.

- -

The POSIX API functions are all re-entrant and thread safe, -regular expressions compiled with regcomp can also be -shared between threads.

- -

The class RegEx is only thread safe if each thread gets its -own RegEx instance (apartment threading) - this is a consequence -of RegEx handling both compiling and matching regular expressions. -

- -

Finally note that changing the global locale invalidates all -compiled regular expressions, therefore calling set_locale -from one thread while another uses regular expressions will -produce unpredictable results.

- -

There is also a requirement that there is only one thread -executing prior to the start of main().

- -
- -

Appendix 3: Localization

- -

 Regex++ provides extensive support for run-time -localization, the localization model used can be split into two -parts: front-end and back-end.

- -

Front-end localization deals with everything which the user -sees - error messages, and the regular expression syntax itself. -For example a French application could change [[:word:]] to [[:mot:]] -and \w to \m. Modifying the front end locale requires active -support from the developer, by providing the library with a -message catalogue to load, containing the localized strings. -Front-end locale is affected by the LC_MESSAGES category only.

- -

Back-end localization deals with everything that occurs after -the expression has been parsed - in other words everything that -the user does not see or interact with directly. It deals with -case conversion, collation, and character class membership. The -back-end locale does not require any intervention from the -developer - the library will acquire all the information it -requires for the current locale from the underlying operating -system / run time library. This means that if the program user -does not interact with regular expressions directly - for example -if the expressions are embedded in your C++ code - then no -explicit localization is required, as the library will take care -of everything for you. For example embedding the expression [[:word:]]+ -in your code will always match a whole word, if the program is -run on a machine with, for example, a Greek locale, then it will -still match a whole word, but in Greek characters rather than -Latin ones. The back-end locale is affected by the LC_TYPE and -LC_COLLATE categories.

- -

There are three separate localization mechanisms supported by -regex++:

- -

Win32 localization model.

- -

This is the default model when the library is compiled under -Win32, and is encapsulated by the traits class w32_regex_traits. -When this model is in effect there is a single global locale as -defined by the user's control panel settings, and returned by -GetUserDefaultLCID. All the settings used by regex++ are acquired -directly from the operating system bypassing the C run time -library. Front-end localization requires a resource dll, -containing a string table with the user-defined strings. The -traits class exports the function:

- -

static std::string set_message_catalogue(const std::string& -s);

- -

which needs to be called with a string identifying the name of -the resource dll, before your code compiles any regular -expressions (but not necessarily before you construct any reg_expression -instances):

- -

boost::w32_regex_traits<char>::set_message_catalogue("mydll.dll"); -

- -

Note that this API sets the dll name for both the -narrow and wide character specializations of w32_regex_traits.

- -

This model does not currently support thread specific locales -(via SetThreadLocale under Windows NT), the library provides full -Unicode support under NT, under Windows 9x the library degrades -gracefully - characters 0 to 255 are supported, the remainder are -treated as "unknown" graphic characters.

- -

C localization model.

- -

This is the default model when the library is compiled under -an operating system other than Win32, and is encapsulated by the -traits class c_regex_traits, -Win32 users can force this model to take effect by defining the -pre-processor symbol BOOST_REGEX_USE_C_LOCALE. When this model is -in effect there is a single global locale, as set by setlocale. -All settings are acquired from your run time library, -consequently Unicode support is dependent upon your run time -library implementation. Front end localization requires a POSIX -message catalogue. The traits class exports the function:

- -

static std::string set_message_catalogue(const std::string& -s);

- -

which needs to be called with a string identifying the name of -the message catalogue, before your code compiles any -regular expressions (but not necessarily before you construct any -reg_expression instances):

- -

boost::c_regex_traits<char>::set_message_catalogue("mycatalogue"); -

- -

Note that this API sets the dll name for both the -narrow and wide character specializations of c_regex_traits. If -your run time library does not support POSIX message catalogues, -then you can either provide your own implementation of -<nl_types.h> or define BOOST_RE_NO_CAT to disable front-end -localization via message catalogues.

- -

Note that calling setlocale invalidates all compiled -regular expressions, calling setlocale(LC_ALL, "C") -will make this library behave equivalent to most traditional -regular expression libraries including version 1 of this library. -

- -

C++ localization model. -

- -

This model is only in effect if the library is built with the -pre-processor symbol BOOST_REGEX_USE_CPP_LOCALE defined. When -this model is in effect each instance of reg_expression<> -has its own instance of std::locale, class reg_expression<> -also has a member function imbue which allows the locale -for the expression to be set on a per-instance basis. Front end -localization requires a POSIX message catalogue, which will be -loaded via the std::messages facet of the expression's locale, -the traits class exports the symbol:

- -

static std::string set_message_catalogue(const std::string& -s);

- -

which needs to be called with a string identifying the name of -the message catalogue, before your code compiles any -regular expressions (but not necessarily before you construct any -reg_expression instances):

- -

boost::cpp_regex_traits<char>::set_message_catalogue("mycatalogue"); -

- -

Note that calling reg_expression<>::imbue will -invalidate any expression currently compiled in that instance of -reg_expression<>. This model is the one which closest fits -the ethos of the C++ standard library, however it is the model -which will produce the slowest code, and which is the least well -supported by current standard library implementations, for -example I have yet to find an implementation of std::locale which -supports either message catalogues, or locales other than "C" -or "POSIX".

- -

Finally note that if you build the library with a non-default -localization model, then the appropriate pre-processor symbol (BOOST_REGEX_USE_C_LOCALE -or BOOST_REGEX_USE_CPP_LOCALE) must be defined both when you -build the support library, and when you include <boost/regex.hpp> -or <boost/cregex.hpp> in your code. The best way to ensure -this is to add the #define to <boost/regex/detail/regex_options.hpp>. -

- -

Providing a message catalogue:

- -

In order to localize the front end of the library, you need to -provide the library with the appropriate message strings -contained either in a resource dll's string table (Win32 model), -or a POSIX message catalogue (C or C++ models). In the latter -case the messages must appear in message set zero of the -catalogue. The messages and their id's are as follows:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Message id Meaning Default value  
 101 The character used to start - a sub-expression. "("  
 102 The character used to end a - sub-expression declaration. ")"  
 103 The character used to denote - an end of line assertion. "$"  
 104 The character used to denote - the start of line assertion. "^"  
 105 The character used to denote - the "match any character expression". "."  
 106 The match zero or more times - repetition operator. "*"  
 107 The match one or more - repetition operator. "+"  
 108 The match zero or one - repetition operator. "?"  
 109 The character set opening - character. "["  
 110 The character set closing - character. "]"  
 111 The alternation operator. "|"  
 112 The escape character. "\\"  
 113 The hash character (not - currently used). "#"  
 114 The range operator. "-"  
 115 The repetition operator - opening character. "{"  
 116 The repetition operator - closing character. "}"  
 117 The digit characters. "0123456789"  
 118 The character which when - preceded by an escape character represents the word - boundary assertion. "b"  
 119 The character which when - preceded by an escape character represents the non-word - boundary assertion. "B"  
 120 The character which when - preceded by an escape character represents the word-start - boundary assertion. "<"  
 121 The character which when - preceded by an escape character represents the word-end - boundary assertion. ">"  
 122 The character which when - preceded by an escape character represents any word - character. "w"  
 123 The character which when - preceded by an escape character represents a non-word - character. "W"  
 124 The character which when - preceded by an escape character represents a start of - buffer assertion. "`A"  
 125 The character which when - preceded by an escape character represents an end of - buffer assertion. "'z"  
 126 The newline character. "\n"  
 127 The comma separator. ","  
 128 The character which when - preceded by an escape character represents the bell - character. "a"  
 129 The character which when - preceded by an escape character represents the form feed - character. "f"  
 130 The character which when - preceded by an escape character represents the newline - character. "n"  
 131 The character which when - preceded by an escape character represents the carriage - return character. "r"  
 132 The character which when - preceded by an escape character represents the tab - character. "t"  
 133 The character which when - preceded by an escape character represents the vertical - tab character. "v"  
 134 The character which when - preceded by an escape character represents the start of a - hexadecimal character constant. "x"  
 135 The character which when - preceded by an escape character represents the start of - an ASCII escape character. "c"  
 136 The colon character. ":"  
 137 The equals character. "="  
 138 The character which when - preceded by an escape character represents the ASCII - escape character. "e"  
 139 The character which when - preceded by an escape character represents any lower case - character. "l"  
 140 The character which when - preceded by an escape character represents any non-lower - case character. "L"  
 141 The character which when - preceded by an escape character represents any upper case - character. "u"  
 142 The character which when - preceded by an escape character represents any non-upper - case character. "U"  
 143 The character which when - preceded by an escape character represents any space - character. "s"  
 144 The character which when - preceded by an escape character represents any non-space - character. "S"  
 145 The character which when - preceded by an escape character represents any digit - character. "d"  
 146 The character which when - preceded by an escape character represents any non-digit - character. "D"  
 147 The character which when - preceded by an escape character represents the end quote - operator. "E"  
 148 The character which when - preceded by an escape character represents the start - quote operator. "Q"  
 149 The character which when - preceded by an escape character represents a Unicode - combining character sequence. "X"  
 150 The character which when - preceded by an escape character represents any single - character. "C"  
 151 The character which when - preceded by an escape character represents end of buffer - operator. "Z"  
 152 The character which when - preceded by an escape character represents the - continuation assertion. "G"  
 153The character which when preceeded by (? indicates a - zero width negated forward lookahead assert.! 
- -


- -

Custom error messages are loaded as follows:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Message ID Error message ID Default string  
 201 REG_NOMATCH "No match"  
 202 REG_BADPAT "Invalid regular - expression"  
 203 REG_ECOLLATE "Invalid collation - character"  
 204 REG_ECTYPE "Invalid character - class name"  
 205 REG_EESCAPE "Trailing backslash" -  
 206 REG_ESUBREG "Invalid back reference" -  
 207 REG_EBRACK "Unmatched [ or [^" -  
 208 REG_EPAREN "Unmatched ( or \\(" -  
 209 REG_EBRACE "Unmatched \\{"  
 210 REG_BADBR "Invalid content of - \\{\\}"  
 211 REG_ERANGE "Invalid range end" -  
 212 REG_ESPACE "Memory exhausted" -  
 213 REG_BADRPT "Invalid preceding - regular expression"  
 214 REG_EEND "Premature end of - regular expression"  
 215 REG_ESIZE "Regular expression too - big"  
 216 REG_ERPAREN "Unmatched ) or \\)" -  
 217 REG_EMPTY "Empty expression" -  
 218 REG_E_UNKNOWN "Unknown error"  
- -


- -

Custom character class names are loaded as followed:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Message ID Description Equivalent default class - name  
 300 The character class name for - alphanumeric characters. "alnum"  
 301 The character class name for - alphabetic characters. "alpha"  
 302 The character class name for - control characters. "cntrl"  
 303 The character class name for - digit characters. "digit"  
 304 The character class name for - graphics characters. "graph"  
 305 The character class name for - lower case characters. "lower"  
 306 The character class name for - printable characters. "print"  
 307 The character class name for - punctuation characters. "punct"  
 308 The character class name for - space characters. "space"  
 309 The character class name for - upper case characters. "upper"  
 310 The character class name for - hexadecimal characters. "xdigit"  
 311 The character class name for - blank characters. "blank"  
 312 The character class name for - word characters. "word"  
 313 The character class name for - Unicode characters. "unicode"  
- -


- -

Finally, custom collating element names are loaded starting -from message id 400, and terminating when the first load -thereafter fails. Each message looks something like: "tagname -string" where tagname is the name used inside [[.tagname.]] -and string is the actual text of the collating element. -Note that the value of collating element [[.zero.]] is used for -the conversion of strings to numbers - if you replace this with -another value then that will be used for string parsing - for -example use the Unicode character 0x0660 for [[.zero.]] if you -want to use Unicode Arabic-Indic digits in your regular -expressions in place of Latin digits.

- -

Note that the POSIX defined names for character classes and -collating elements are always available - even if custom names -are defined, in contrast, custom error messages, and custom -syntax messages replace the default ones.

- -
- -

Appendix 4: Example Applications

- -

There are three demo applications that ship with this library, -they all come with makefiles for Borland, Microsoft and gcc -compilers, otherwise you will have to create your own makefiles.

- -
regress.exe:
- -

A regression test application that gives the matching/searching -algorithms a full workout. The presence of this program is your -guarantee that the library will behave as claimed - at least as -far as those items tested are concerned - if anyone spots -anything that isn't being tested I'd be glad to hear about it.

- -

Files: parse.cpp, regress.cpp, tests.cpp.

- -
jgrep.exe
- -

A simple grep implementation, run with no command line options -to find out its usage. Look at fileiter.cpp/fileiter.hpp -and the mapfile class to see an example of a "smart" -bidirectional iterator that can be used with regex++ or any other -STL algorithm.

- -

Files: jgrep.cpp, main.cpp.

- -
timer.exe
- -

A simple interactive expression matching application, the -results of all matches are timed, allowing the programmer to -optimize their regular expressions where performance is critical. -

- -

Files: regex_timer.cpp. -

- -

The snippets examples contain the code examples used in the -documentation:

- -

regex_match_example.cpp: -ftp based regex_match example.

- -

regex_search_example.cpp: -regex_search example: searches a cpp file for class definitions.

- -

regex_grep_example_1.cpp: -regex_grep example 1: searches a cpp file for class definitions.

- -

regex_merge_example.cpp: -regex_merge example: converts a C++ file to syntax highlighted -HTML.

- -

regex_grep_example_2.cpp: -regex_grep example 2: searches a cpp file for class definitions, -using a global callback function.

- -

regex_grep_example_3.cpp: -regex_grep example 2: searches a cpp file for class definitions, -using a bound member function callback.

- -

regex_grep_example_4.cpp: -regex_grep example 2: searches a cpp file for class definitions, -using a C++ Builder closure as a callback.

- -

regex_split_example_1.cpp: -regex_split example: split a string into tokens.

- -

regex_split_example_2.cpp: -regex_split example: spit out linked URL's.

- -
- -

Appendix 5: Header Files

- -

There are two main headers used by this library: <boost/regex.hpp> -provides full access to the entire library, while <boost/cregex.hpp> -provides access to just the high level class RegEx, and the POSIX -API functions.

- -
- -

Appendix 6: Redistributables

- -

 If you are using Microsoft or Borland C++ and link to a -dll version of the run time library, then you will also link to -one of the dll versions of regex++. While these dll's are -redistributable, there are no "standard" versions, so -when installing on the users PC, you should place these in a -directory private to your application, and not in the PC's -directory path. Note that if you link to a static version of your -run time library, then you will also link to a static version of -regex++ and no dll's will need to be distributed. The possible -regex++ dll and library names are computed according to the -following formula:
-

- -

"boost_regex_"
-+ BOOST_LIB_TOOLSET
-+ "_"
-+ BOOST_LIB_THREAD_OPT
-+ BOOST_LIB_RT_OPT
-+ BOOST_LIB_LINK_OPT
-+ BOOST_LIB_DEBUG_OPT
-
-These are defined as:
-
-BOOST_LIB_TOOLSET: The compiler toolset name (vc6, vc7, bcb5 etc).
-
-BOOST_LIB_THREAD_OPT: "s" for single thread builds,
-"m" for multithread builds.
-
-BOOST_LIB_RT_OPT: "s" for static runtime,
-"d" for dynamic runtime.
-
-BOOST_LIB_LINK_OPT: "s" for static link,
-"i" for dynamic link.
-
-BOOST_LIB_DEBUG_OPT: nothing for release builds,
-"d" for debug builds,
-"dd" for debug-diagnostic builds (_STLP_DEBUG).

- -

Note: you can disable automatic library selection by defining -the symbol BOOST_REGEX_NO_LIB when compiling, this is useful if -you want to statically link even though you're using the dll -version of your run time library, or if you need to debug regex++. -

- -
- -

Notes for upgraders

- -

This version of regex++ is the first to be ported to the boost project, and as a result -has a number of changes to comply with the boost coding -guidelines.

- -

Headers have been changed from <header> or <header.h> -to <boost/header.hpp>

- -

The library namespace has changed from "jm", to -"boost".

- -

The reg_xxx algorithms have been renamed regex_xxx (to improve -naming consistency).

- -

Algorithm query_match has been renamed regex_match, and only -returns true if the expression matches the whole of the input -string (think input data validation).

- -

Compiling existing code:

- -

The directory, libs/regex/old_include contains a set of -headers that make this version of regex++ compatible with -previous ones, either add this directory to your include path, or -copy these headers to the root directory of your boost -installation. The contents of these headers are deprecated and -undocumented - really these are just here for existing code - for -new projects use the new header forms.

- -
- -

Further Information (Contacts and -Acknowledgements)

- -

The author can be contacted at John_Maddock@compuserve.com, -the home page for this library is at http://ourworld.compuserve.com/homepages/John_Maddock/regexpp.htm, -and the official boost version can be obtained from www.boost.org/libraries.htm.

- -

I am indebted to Robert Sedgewick's "Algorithms in C++" -for forcing me to think about algorithms and their performance, -and to the folks at boost for forcing me to think, period. -The following people have all contributed useful comments or -fixes: Dave Abrahams, Mike Allison, Edan Ayal, Jayashree -Balasubramanian, Jan Bölsche, Beman Dawes, Paul Baxter, David -Bergman, David Dennerline, Edward Diener, Peter Dimov, Robert -Dunn, Fabio Forno, Tobias Gabrielsson, Rob Gillen, Marc Gregoire, -Chris Hecker, Nick Hodapp, Jesse Jones, Martin Jost, Boris -Krasnovskiy, Jan Hermelink, Max Leung, Wei-hao Lin, Jens Maurer, -Richard Peters, Heiko Schmidt, Jason Shirk, Gerald Slacik, Scobie -Smith, Mike Smyth, Alexander Sokolovsky, Hervé Poirier, Michael -Raykh, Marc Recht, Scott VanCamp, Bruno Voigt, Alexey Voinov, -Jerry Waldorf, Rob Ward, Lealon Watts, Thomas Witt and Yuval -Yosef. I am also grateful to the manuals supplied with the Henry -Spencer, Perl and GNU regular expression libraries - wherever -possible I have tried to maintain compatibility with these -libraries and with the POSIX standard - the code however is -entirely my own, including any bugs! I can absolutely guarantee -that I will not fix any bugs I don't know about, so if you have -any comments or spot any bugs, please get in touch.

- -

Useful further information can be found at:

- -

A short tutorial on regular expressions can -be found here.

- -

The Open -Unix Specification contains a wealth of useful material, -including the regular expression syntax, and specifications for <regex.h> -and <nl_types.h>. -

- -

The Pattern -Matching Pointers site is a "must visit" resource -for anyone interested in pattern matching.

- -

Glimpse and Agrep, -use a simplified regular expression syntax to achieve faster -search times.

- -

Udi Manber -and Ricardo Baeza-Yates -both have a selection of useful pattern matching papers available -from their respective web sites.

- -
- -

Copyright Dr -John Maddock 1998-2000 all rights reserved.

- - diff --git a/doc/Attic/standards.html b/doc/Attic/standards.html new file mode 100644 index 00000000..35a2e67e --- /dev/null +++ b/doc/Attic/standards.html @@ -0,0 +1,79 @@ + + + + Boost.Regex: Standards Conformance + + + + +

+ + + + + + +
+

C++ Boost

+
+

Boost.Regex

+

Standards Conformance

+
+

Boost.Regex Index

+
+

+
+

C++

+

Boost.regex is intended to conform to the + regular expression standardization proposal, which will appear in a + future C++ standard technical report (and hopefully in a future version of the + standard).  Currently there are some differences in how the regular + expression traits classes are defined, these will be fixed in a future release.

+

ECMAScript / JavaScript

+

All of the ECMAScript regular expression syntax features are supported, except + that:

+

Negated class escapes (\S, \D and \W) are not permitted inside character class + definitions ( [...] ).

+

The escape sequence \u matches any upper case character (the same as + [[:upper:]]) rather than a Unicode escape sequence; use \x{DDDD} for + Unicode escape sequences.

+

Perl

+

Almost all Perl features are supported, except for:

+

\N{name}  Use [[:name:]] instead.

+

\pP and \PP

+

(?imsx-imsx)

+

(?<=pattern)

+

(?<!pattern)

+

(?{code})

+

(??{code})

+

(?(condition)yes-pattern) and (?(condition)yes-pattern|no-pattern)

+

These embarrassments / limitations will be removed in due course, mainly + dependent upon user demand.

+

POSIX

+

All the POSIX basic and extended regular expression features are supported, + except that:

+

No character collating names are recognized except those specified in the POSIX + standard for the C locale, unless they are explicitly registered with the + traits class.

+

Character equivalence classes ( [[=a=]] etc) are probably buggy except on + Win32.  Implementing this feature requires knowledge of the format of the + string sort keys produced by the system; if you need this, and the default + implementation doesn't work on your platform, then you will need to supply a + custom traits class.

+
+

Revised + + 17 May 2003 + +

+

© Copyright John Maddock 1998- 2003

+

Permission to use, copy, modify, distribute and sell this software + and its documentation for any purpose is hereby granted without fee, provided + that the above copyright notice appear in all copies and that both that + copyright notice and this permission notice appear in supporting documentation. + Dr John Maddock makes no representations about the suitability of this software + for any purpose. It is provided "as is" without express or implied warranty.

+ + + + diff --git a/doc/Attic/sub_match.html b/doc/Attic/sub_match.html new file mode 100644 index 00000000..db995312 --- /dev/null +++ b/doc/Attic/sub_match.html @@ -0,0 +1,426 @@ + + + + Boost.Regex: sub_match + + + + +

+ + + + + + +
+

C++ Boost

+
+

Boost.Regex

+

sub_match

+
+

Boost.Regex Index

+
+

+
+

Synopsis

+

#include <boost/regex.hpp> +

+

Regular expressions are different from many simple pattern-matching algorithms + in that as well as finding an overall match they can also produce + sub-expression matches: each sub-expression being delimited in the pattern by a + pair of parenthesis (...). There has to be some method for reporting + sub-expression matches back to the user: this is achieved this by defining a + class match_results that acts as an + indexed collection of sub-expression matches, each sub-expression match being + contained in an object of type sub_match + . +

Objects of type sub_match may only obtained by subscripting an object + of type match_results + . +

When the marked sub-expression denoted by an object of type sub_match<> + participated in a regular expression match then member matched evaluates + to true, and members first and second denote the + range of characters [first,second) which formed that match. + Otherwise matched is false, and members first and second + contained undefined values.

+

If an object of type sub_match<> represents sub-expression 0 + - that is to say the whole match - then member matched is always + true, unless a partial match was obtained as a result of the flag match_partial + being passed to a regular expression algorithm, in which case member matched + is false, and members first and second represent the + character range that formed the partial match.

+
+namespace boost{
+      
+template <class BidirectionalIterator>
+class sub_match : public std::pair<BidirectionalIterator, BidirectionalIterator>
+{
+public:
+   typedef typename iterator_traits<BidirectionalIterator>::value_type       value_type;
+   typedef typename iterator_traits<BidirectionalIterator>::difference_type  difference_type;
+   typedef          BidirectionalIterator                                    iterator;
+
+   bool matched;
+
+   difference_type length()const;
+   operator basic_string<value_type>()const;
+   basic_string<value_type> str()const;
+
+   int compare(const sub_match& s)const;
+   int compare(const basic_string<value_type>& s)const;
+   int compare(const value_type* s)const;
+};
+
+template <class BidirectionalIterator>
+bool operator == (const sub_match<BidirectionalIterator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator>
+bool operator != (const sub_match<BidirectionalIterator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator>
+bool operator < (const sub_match<BidirectionalIterator>& lhs,
+                 const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator>
+bool operator <= (const sub_match<BidirectionalIterator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator>
+bool operator >= (const sub_match<BidirectionalIterator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator>
+bool operator > (const sub_match<BidirectionalIterator>& lhs,
+                 const sub_match<BidirectionalIterator>& rhs);
+
+
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator == (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator != (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator < (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
+                 const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator > (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
+                 const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator >= (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator <= (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator == (const sub_match<BidirectionalIterator>& lhs,
+                  const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator != (const sub_match<BidirectionalIterator>& lhs,
+                  const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator < (const sub_match<BidirectionalIterator>& lhs,
+                 const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator > (const sub_match<BidirectionalIterator>& lhs,
+                 const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator >= (const sub_match<BidirectionalIterator>& lhs,
+                  const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator <= (const sub_match<BidirectionalIterator>& lhs,
+                  const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
+
+template <class BidirectionalIterator> 
+bool operator == (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator != (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator < (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                 const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator > (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                 const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator >= (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator <= (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+
+template <class BidirectionalIterator> 
+bool operator == (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+template <class BidirectionalIterator> 
+bool operator != (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+template <class BidirectionalIterator> 
+bool operator < (const sub_match<BidirectionalIterator>& lhs, 
+                 typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+template <class BidirectionalIterator> 
+bool operator > (const sub_match<BidirectionalIterator>& lhs, 
+                 typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+template <class BidirectionalIterator> 
+bool operator >= (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+template <class BidirectionalIterator> 
+bool operator <= (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+
+template <class BidirectionalIterator> 
+bool operator == (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator != (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator < (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                 const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator > (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                 const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator >= (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator <= (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+
+template <class BidirectionalIterator> 
+bool operator == (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+template <class BidirectionalIterator> 
+bool operator != (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+template <class BidirectionalIterator> 
+bool operator < (const sub_match<BidirectionalIterator>& lhs, 
+                 typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+template <class BidirectionalIterator> 
+bool operator > (const sub_match<BidirectionalIterator>& lhs, 
+                 typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+template <class BidirectionalIterator> 
+bool operator >= (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+template <class BidirectionalIterator> 
+bool operator <= (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+
+template <class charT, class traits, class BidirectionalIterator>
+basic_ostream<charT, traits>&
+   operator << (basic_ostream<charT, traits>& os,
+                const sub_match<BidirectionalIterator>& m);
+
+} // namespace boost
+

Description

+

+ sub_match members

+
typedef typename std::iterator_traits<iterator>::value_type value_type;
+

The type pointed to by the iterators.

+
typedef typename std::iterator_traits<iterator>::difference_type difference_type;
+

A type that represents the difference between two iterators.

+
typedef iterator iterator_type;
+

The iterator type.

+
iterator first
+

An iterator denoting the position of the start of the match.

+
iterator second
+

An iterator denoting the position of the end of the match.

+
bool matched
+

A Boolean value denoting whether this sub-expression participated in the match.

+
static difference_type length();
+ +

+ Effects: returns (matched ? 0 : distance(first, second)).

operator basic_string<value_type>()const;
+ +

+ Effects: returns (matched ? basic_string<value_type>(first, + second) : basic_string<value_type>()).

basic_string<value_type> str()const;
+ +

+ Effects: returns (matched ? basic_string<value_type>(first, + second) : basic_string<value_type>()).

int compare(const sub_match& s)const;
+ +

+ Effects: returns str().compare(s.str()).

int compare(const basic_string<value_type>& s)const;
+ +

+ Effects: returns str().compare(s).

int compare(const value_type* s)const;
+ +

+ Effects: returns str().compare(s).

+

+ sub_match non-member operators

+
template <class BidirectionalIterator>
+bool operator == (const sub_match<BidirectionalIterator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+ +

+ Effects: returns lhs.compare(rhs) == 0.

template <class BidirectionalIterator>
+bool operator != (const sub_match<BidirectionalIterator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+ +

+ Effects: returns lhs.compare(rhs) != 0.

template <class BidirectionalIterator>
+bool operator < (const sub_match<BidirectionalIterator>& lhs,
+                 const sub_match<BidirectionalIterator>& rhs);
+ +

+ Effects: returns lhs.compare(rhs) < 0.

template <class BidirectionalIterator>
+bool operator <= (const sub_match<BidirectionalIterator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+ +

+ Effects: returns lhs.compare(rhs) <= 0.

template <class BidirectionalIterator>
+bool operator >= (const sub_match<BidirectionalIterator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+ +

+ Effects: returns lhs.compare(rhs) >= 0.

template <class BidirectionalIterator>
+bool operator > (const sub_match<BidirectionalIterator>& lhs,
+                 const sub_match<BidirectionalIterator>& rhs);
+ +

+ Effects: returns lhs.compare(rhs) > 0.

template <class BidirectionalIterator> 
+bool operator == (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs == rhs.str().

template <class BidirectionalIterator> 
+bool operator != (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs != rhs.str().

template <class BidirectionalIterator> 
+bool operator < (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                 const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs < rhs.str().

template <class BidirectionalIterator> 
+bool operator > (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                 const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs > rhs.str().

template <class BidirectionalIterator> 
+bool operator >= (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs >= rhs.str().

template <class BidirectionalIterator> 
+bool operator <= (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs <= rhs.str().

template <class BidirectionalIterator> 
+bool operator == (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+ +

+ Effects: returns lhs.str() == rhs.

template <class BidirectionalIterator> 
+bool operator != (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+ +

+ Effects: returns lhs.str() != rhs.

template <class BidirectionalIterator> 
+bool operator < (const sub_match<BidirectionalIterator>& lhs, 
+                 typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+ +

+ Effects: returns lhs.str() < rhs.

template <class BidirectionalIterator> 
+bool operator > (const sub_match<BidirectionalIterator>& lhs, 
+                 typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+ +

+ Effects: returns lhs.str() > rhs.

template <class BidirectionalIterator> 
+bool operator >= (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+ +

+ Effects: returns lhs.str() >= rhs.

template <class BidirectionalIterator> 
+bool operator <= (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+ +

+ Effects: returns lhs.str() <= rhs.

template <class BidirectionalIterator> 
+bool operator == (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs == rhs.str().

template <class BidirectionalIterator> 
+bool operator != (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs != rhs.str().

template <class BidirectionalIterator> 
+bool operator < (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                 const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs < rhs.str().

template <class BidirectionalIterator> 
+bool operator > (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                 const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs > rhs.str().

template <class BidirectionalIterator> 
+bool operator >= (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs >= rhs.str().

template <class BidirectionalIterator> 
+bool operator <= (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs <= rhs.str().

template <class BidirectionalIterator> 
+bool operator == (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+ +

+ Effects: returns lhs.str() == rhs.

template <class BidirectionalIterator> 
+bool operator != (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+ +

+ Effects: returns lhs.str() != rhs.

template <class BidirectionalIterator> 
+bool operator < (const sub_match<BidirectionalIterator>& lhs, 
+                 typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+ +

+ Effects: returns lhs.str() < rhs.

template <class BidirectionalIterator> 
+bool operator > (const sub_match<BidirectionalIterator>& lhs, 
+                 typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+ +

+ Effects: returns lhs.str() > rhs.

template <class BidirectionalIterator> 
+bool operator >= (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+ +

+ Effects: returns lhs.str() >= rhs.

template <class BidirectionalIterator> 
+bool operator <= (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+ +

+ Effects: returns lhs.str() <= rhs.

template <class charT, class traits, class BidirectionalIterator>
+basic_ostream<charT, traits>&
+   operator << (basic_ostream<charT, traits>& os
+                const sub_match<BidirectionalIterator>& m);
+ +

+ Effects: returns (os << m.str()). +


+

Revised + + 17 May 2003 + +

+

© Copyright John Maddock 1998- 2003

+

Permission to use, copy, modify, distribute and sell this software + and its documentation for any purpose is hereby granted without fee, provided + that the above copyright notice appear in all copies and that both that + copyright notice and this permission notice appear in supporting documentation. + Dr John Maddock makes no representations about the suitability of this software + for any purpose. It is provided "as is" without express or implied warranty.

+ + + + diff --git a/doc/Attic/syntax.html b/doc/Attic/syntax.html new file mode 100644 index 00000000..f776cd3c --- /dev/null +++ b/doc/Attic/syntax.html @@ -0,0 +1,773 @@ + + + + Boost.Regex: Regular Expression Syntax + + + + +

+ + + + + + +
+

C++ Boost

+
+

Boost.Regex

+

Regular Expression Syntax

+
+

Boost.Regex Index

+
+

+
+

This section covers the regular expression syntax used by this library, this is + a programmers guide, the actual syntax presented to your program's users will + depend upon the flags used during expression compilation. +

+

Literals +

+

All characters are literals except: ".", "|", "*", "?", "+", "(", ")", "{", + "}", "[", "]", "^", "$" and "\". These characters are literals when preceded by + a "\". A literal is a character that matches itself, or matches the result of + traits_type::translate(), where traits_type is the traits template parameter to + class basic_regex.

+

Wildcard +

+

The dot character "." matches any single character except : when match_not_dot_null + is passed to the matching algorithms, the dot does not match a null character; + when match_not_dot_newline is passed to the matching algorithms, then + the dot does not match a newline character. +

+

Repeats +

+

A repeat is an expression that is repeated an arbitrary number of times. An + expression followed by "*" can be repeated any number of times including zero. + An expression followed by "+" can be repeated any number of times, but at least + once, if the expression is compiled with the flag regex_constants::bk_plus_qm + then "+" is an ordinary character and "\+" represents a repeat of once or more. + An expression followed by "?" may be repeated zero or one times only, if the + expression is compiled with the flag regex_constants::bk_plus_qm then "?" is an + ordinary character and "\?" represents the repeat zero or once operator. When + it is necessary to specify the minimum and maximum number of repeats + explicitly, the bounds operator "{}" may be used, thus "a{2}" is the letter "a" + repeated exactly twice, "a{2,4}" represents the letter "a" repeated between 2 + and 4 times, and "a{2,}" represents the letter "a" repeated at least twice with + no upper limit. Note that there must be no white-space inside the {}, and there + is no upper limit on the values of the lower and upper bounds. When the + expression is compiled with the flag regex_constants::bk_braces then "{" and + "}" are ordinary characters and "\{" and "\}" are used to delimit bounds + instead. All repeat expressions refer to the shortest possible previous + sub-expression: a single character; a character set, or a sub-expression + grouped with "()" for example. +

+

Examples: +

+

"ba*" will match all of "b", "ba", "baaa" etc. +

+

"ba+" will match "ba" or "baaaa" for example but not "b". +

+

"ba?" will match "b" or "ba". +

+

"ba{2,4}" will match "baa", "baaa" and "baaaa". +

+

Non-greedy repeats +

+

Whenever the "extended" regular expression syntax is in use (the default) then + non-greedy repeats are possible by appending a '?' after the repeat; a + non-greedy repeat is one which will match the shortest possible string. +

+

For example to match html tag pairs one could use something like: +

+

"<\s*tagname[^>]*>(.*?)<\s*/tagname\s*>" +

+

In this case $1 will contain the text between the tag pairs, and will be the + shortest possible matching string.  +

+

Parenthesis +

+

Parentheses serve two purposes, to group items together into a sub-expression, + and to mark what generated the match. For example the expression "(ab)*" would + match all of the string "ababab". The matching algorithms + regex_match and regex_search + each take an instance of match_results + that reports what caused the match, on exit from these functions the + match_results contains information both on what the whole expression + matched and on what each sub-expression matched. In the example above + match_results[1] would contain a pair of iterators denoting the final "ab" of + the matching string. It is permissible for sub-expressions to match null + strings. If a sub-expression takes no part in a match - for example if it is + part of an alternative that is not taken - then both of the iterators that are + returned for that sub-expression point to the end of the input string, and the matched + parameter for that sub-expression is false. Sub-expressions are indexed + from left to right starting from 1, sub-expression 0 is the whole expression. +

+

Non-Marking Parenthesis +

+

Sometimes you need to group sub-expressions with parenthesis, but don't want + the parenthesis to spit out another marked sub-expression, in this case a + non-marking parenthesis (?:expression) can be used. For example the following + expression creates no sub-expressions: +

+

"(?:abc)*"

+

Forward Lookahead Asserts  +

+

There are two forms of these; one for positive forward lookahead asserts, and + one for negative lookahead asserts:

+

"(?=abc)" matches zero characters only if they are followed by the expression + "abc".

+

"(?!abc)" matches zero characters only if they are not followed by the + expression "abc".

+

Independent sub-expressions

+

"(?>expression)" matches "expression" as an independent atom (the algorithm + will not backtrack into it if a failure occurs later in the expression).

+

Alternatives +

+

Alternatives occur when the expression can match either one sub-expression or + another, each alternative is separated by a "|", or a "\|" if the flag + regex_constants::bk_vbar is set, or by a newline character if the flag + regex_constants::newline_alt is set. Each alternative is the largest possible + previous sub-expression; this is the opposite behavior from repetition + operators. +

+

Examples: +

+

"a(b|c)" could match "ab" or "ac". +

+

"abc|def" could match "abc" or "def". +

+

Sets +

+

A set is a set of characters that can match any single character that is a + member of the set. Sets are delimited by "[" and "]" and can contain literals, + character ranges, character classes, collating elements and equivalence + classes. Set declarations that start with "^" contain the compliment of the + elements that follow. +

+

Examples: +

+

Character literals: +

+

"[abc]" will match either of "a", "b", or "c". +

+

"[^abc] will match any character other than "a", "b", or "c". +

+

Character ranges: +

+

"[a-z]" will match any character in the range "a" to "z". +

+

"[^A-Z]" will match any character other than those in the range "A" to "Z". +

+

Note that character ranges are highly locale dependent if the flag + regex_constants::collate is set: they match any character that collates between + the endpoints of the range, ranges will only behave according to ASCII rules + when the default "C" locale is in effect. For example if the library is + compiled with the Win32 localization model, then [a-z] will match the ASCII + characters a-z, and also 'A', 'B' etc, but not 'Z' which collates just after + 'z'. This locale specific behavior is disabled by default (in perl mode), and + forces ranges to collate according to ASCII character code. +

+

Character classes are denoted using the syntax "[:classname:]" within a set + declaration, for example "[[:space:]]" is the set of all whitespace characters. + Character classes are only available if the flag regex_constants::char_classes + is set. The available character classes are: +
+   +

+

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
 alnumAny alpha numeric character. 
 alphaAny alphabetical character a-z and A-Z. Other + characters may also be included depending upon the locale. 
 blankAny blank character, either a space or a tab. 
 cntrlAny control character. 
 digitAny digit 0-9. 
 graphAny graphical character. 
 lowerAny lower case character a-z. Other characters may + also be included depending upon the locale. 
 printAny printable character. 
 punctAny punctuation character. 
 spaceAny whitespace character. 
 upperAny upper case character A-Z. Other characters may + also be included depending upon the locale. 
 xdigitAny hexadecimal digit character, 0-9, a-f and A-F. 
 wordAny word character - all alphanumeric characters plus + the underscore. 
 UnicodeAny character whose code is greater than 255, this + applies to the wide character traits classes only. 
+

+

There are some shortcuts that can be used in place of the character classes, + provided the flag regex_constants::escape_in_lists is set then you can use: +

+

\w in place of [:word:] +

+

\s in place of [:space:] +

+

\d in place of [:digit:] +

+

\l in place of [:lower:] +

+

\u in place of [:upper:]  +

+

Collating elements take the general form [.tagname.] inside a set declaration, + where tagname is either a single character, or a name of a collating + element, for example [[.a.]] is equivalent to [a], and [[.comma.]] is + equivalent to [,]. The library supports all the standard POSIX collating + element names, and in addition the following digraphs: "ae", "ch", "ll", "ss", + "nj", "dz", "lj", each in lower, upper and title case variations. + Multi-character collating elements can result in the set matching more than one + character, for example [[.ae.]] would match two characters, but note that + [^[.ae.]] would only match one character.  +

+

+ Equivalence classes take the general form[=tagname=] inside a set declaration, + where tagname is either a single character, or a name of a collating + element, and matches any character that is a member of the same primary + equivalence class as the collating element [.tagname.]. An equivalence class is + a set of characters that collate the same, a primary equivalence class is a set + of characters whose primary sort key are all the same (for example strings are + typically collated by character, then by accent, and then by case; the primary + sort key then relates to the character, the secondary to the accentation, and + the tertiary to the case). If there is no equivalence class corresponding to tagname + , then[=tagname=] is exactly the same as [.tagname.]. Unfortunately there is no + locale independent method of obtaining the primary sort key for a character, + except under Win32. For other operating systems the library will "guess" the + primary sort key from the full sort key (obtained from strxfrm), so + equivalence classes are probably best considered broken under any operating + system other than Win32.  +

+

To include a literal "-" in a set declaration then: make it the first character + after the opening "[" or "[^", the endpoint of a range, a collating element, or + if the flag regex_constants::escape_in_lists is set then precede with an escape + character as in "[\-]". To include a literal "[" or "]" or "^" in a set then + make them the endpoint of a range, a collating element, or precede with an + escape character if the flag regex_constants::escape_in_lists is set. +

+

Line anchors +

+

An anchor is something that matches the null string at the start or end of a + line: "^" matches the null string at the start of a line, "$" matches the null + string at the end of a line. +

+

Back references +

+

A back reference is a reference to a previous sub-expression that has already + been matched, the reference is to what the sub-expression matched, not to the + expression itself. A back reference consists of the escape character "\" + followed by a digit "1" to "9", "\1" refers to the first sub-expression, "\2" + to the second etc. For example the expression "(.*)\1" matches any string that + is repeated about its mid-point for example "abcabc" or "xyzxyz". A back + reference to a sub-expression that did not participate in any match, matches + the null string: NB this is different to some other regular expression + matchers. Back references are only available if the expression is compiled with + the flag regex_constants::bk_refs set. +

+

Characters by code +

+

This is an extension to the algorithm that is not available in other libraries, + it consists of the escape character followed by the digit "0" followed by the + octal character code. For example "\023" represents the character whose octal + code is 23. Where ambiguity could occur use parentheses to break the expression + up: "\0103" represents the character whose code is 103, "(\010)3 represents the + character 10 followed by "3". To match characters by their hexadecimal code, + use \x followed by a string of hexadecimal digits, optionally enclosed inside + {}, for example \xf0 or \x{aff}, notice the latter example is a Unicode + character.

+

Word operators +

+

The following operators are provided for compatibility with the GNU regular + expression library. +

+

"\w" matches any single character that is a member of the "word" character + class, this is identical to the expression "[[:word:]]". +

+

"\W" matches any single character that is not a member of the "word" character + class, this is identical to the expression "[^[:word:]]". +

+

"\<" matches the null string at the start of a word. +

+

"\>" matches the null string at the end of the word. +

+

"\b" matches the null string at either the start or the end of a word. +

+

"\B" matches a null string within a word. +

+

The start of the sequence passed to the matching algorithms is considered to be + a potential start of a word unless the flag match_not_bow is set. The end of + the sequence passed to the matching algorithms is considered to be a potential + end of a word unless the flag match_not_eow is set. +

+

Buffer operators +

+

The following operators are provided for compatibility with the GNU regular + expression library, and Perl regular expressions: +

+

"\`" matches the start of a buffer. +

+

"\A" matches the start of the buffer. +

+

"\'" matches the end of a buffer. +

+

"\z" matches the end of a buffer. +

+

"\Z" matches the end of a buffer, or possibly one or more new line characters + followed by the end of the buffer. +

+

A buffer is considered to consist of the whole sequence passed to the matching + algorithms, unless the flags match_not_bob or match_not_eob are set. +

+

Escape operator +

+

The escape character "\" has several meanings. +

+

Inside a set declaration the escape character is a normal character unless the + flag regex_constants::escape_in_lists is set in which case whatever follows the + escape is a literal character regardless of its normal meaning. +

+

The escape operator may introduce an operator for example: back references, or + a word operator. +

+

The escape operator may make the following character normal, for example "\*" + represents a literal "*" rather than the repeat operator. +

+

Single character escape sequences +

+

The following escape sequences are aliases for single characters: +
+   +

+

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
 Escape sequence + Character code + Meaning +  
 \a + 0x07 + Bell character. +  
 \f + 0x0C + Form feed. +  
 \n + 0x0A + Newline character. +  
 \r + 0x0D + Carriage return. +  
 \t + 0x09 + Tab character. +  
 \v + 0x0B + Vertical tab. +  
 \e + 0x1B + ASCII Escape character. +  
 \0dd + 0dd + An octal character code, where dd is one or + more octal digits. +  
 \xXX + 0xXX + A hexadecimal character code, where XX is one or more + hexadecimal digits. +  
 \x{XX} + 0xXX + A hexadecimal character code, where XX is one or more + hexadecimal digits, optionally a Unicode character. +  
 \cZ + z-@ + An ASCII escape sequence control-Z, where Z is any + ASCII character greater than or equal to the character code for '@'. +  
+

+

Miscellaneous escape sequences: +

+

The following are provided mostly for perl compatibility, but note that there + are some differences in the meanings of \l \L \u and \U: +
+   +

+

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
 \w + Equivalent to [[:word:]]. +  
 \W + Equivalent to [^[:word:]]. +  
 \s + Equivalent to [[:space:]]. +  
 \S + Equivalent to [^[:space:]]. +  
 \d + Equivalent to [[:digit:]]. +  
 \D + Equivalent to [^[:digit:]]. +  
 \l + Equivalent to [[:lower:]]. +  
 \L + Equivalent to [^[:lower:]]. +  
 \u + Equivalent to [[:upper:]]. +  
 \U + Equivalent to [^[:upper:]]. +  
 \C + Any single character, equivalent to '.'. +  
 \X + Match any Unicode combining character sequence, for + example "a\x 0301" (a letter a with an acute). +  
 \Q + The begin quote operator, everything that follows is + treated as a literal character until a \E end quote operator is found. +  
 \E + The end quote operator, terminates a sequence begun + with \Q. +  
+

+

What gets matched? +

+

+ When the expression is compiled as a Perl-compatible regex then the matching + algorithms will perform a depth first search on the state machine and report + the first match found.

+

+ When the expression is compiled as a POSIX-compatible regex then the matching + algorithms will match the first possible matching string, if more than one + string starting at a given location can match then it matches the longest + possible string, unless the flag match_any is set, in which case the first + match encountered is returned. Use of the match_any option can reduce the time + taken to find the match - but is only useful if the user is less concerned + about what matched - for example it would not be suitable for search and + replace operations. In cases where their are multiple possible matches all + starting at the same location, and all of the same length, then the match + chosen is the one with the longest first sub-expression, if that is the same + for two or more matches, then the second sub-expression will be examined and so + on. +

+ The following table examples illustrate the main differences between Perl and + POSIX regular expression matching rules: +

+

+ + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Expression

+
+

Text

+
+

POSIX leftmost longest match

+
+

ECMAScript depth first search match

+
+

a|ab

+
+

+ xaby +

+
+

+ "ab"

+

+ "a"

+

+ .*([[:alnum:]]+).*

+

+ " abc def xyz "

+

$0 = " abc def xyz "
+ $1 = "abc"

+
+

$0 = " abc def xyz "
+ $1 = "z"

+
+

+ .*(a|xayy)

+

+ zzxayyzz

+

+ "zzxayy"

+

"zzxa"

+
+

These differences between Perl matching rules, and POSIX matching rules, mean + that these two regular expression syntaxes differ not only in the features + offered, but also in the form that the state machine takes and/or the + algorithms used to traverse the state machine.

+
+

Revised + + 17 May 2003 + +

+

© Copyright John Maddock 1998- 2003

+

Permission to use, copy, modify, distribute and sell this software + and its documentation for any purpose is hereby granted without fee, provided + that the above copyright notice appear in all copies and that both that + copyright notice and this permission notice appear in supporting documentation. + Dr John Maddock makes no representations about the suitability of this software + for any purpose. It is provided "as is" without express or implied warranty.

+ + + + diff --git a/doc/Attic/syntax_option_type.html b/doc/Attic/syntax_option_type.html new file mode 100644 index 00000000..532d6386 --- /dev/null +++ b/doc/Attic/syntax_option_type.html @@ -0,0 +1,332 @@ + + + + Boost.Regex: syntax_option_type + + + + +

+ + + + + + +
+

C++ Boost

+
+

Boost.Regex

+

syntax_option_type

+
+

Boost.Regex Index

+
+

+
+

Synopsis

+

Type syntax_option type is an implementation defined bitmask type that controls + how a regular expression string is to be interpreted.  For convenience + note that all the constants listed here, are also duplicated within the scope + of class template basic_regex.

+
namespace std{ namespace regex_constants{
+
+typedef bitmask_type syntax_option_type;
+// these flags are standardized:
+static const syntax_option_type normal;
+static const syntax_option_type icase;
+static const syntax_option_type nosubs;
+static const syntax_option_type optimize;
+static const syntax_option_type collate;
+static const syntax_option_type ECMAScript = normal;
+static const syntax_option_type JavaScript = normal;
+static const syntax_option_type JScript = normal;
+static const syntax_option_type basic;
+static const syntax_option_type extended;
+static const syntax_option_type awk;
+static const syntax_option_type grep;
+static const syntax_option_type egrep;
+static const syntax_option_type sed = basic;
+static const syntax_option_type perl;
// these are boost.regex specific:
static const syntax_option_type escape_in_lists;
static const syntax_option_type char_classes;
static const syntax_option_type intervals;
static const syntax_option_type limited_ops;
static const syntax_option_type newline_alt;
static const syntax_option_type bk_plus_qm;
static const syntax_option_type bk_braces;
static const syntax_option_type bk_parens;
static const syntax_option_type bk_refs;
static const syntax_option_type bk_vbar;
static const syntax_option_type use_except;
static const syntax_option_type failbit;
static const syntax_option_type literal;
static const syntax_option_type nocollate;
static const syntax_option_type perlex;
static const syntax_option_type emacs;
+} // namespace regex_constants +} // namespace std
+

Description

+

The type syntax_option_type is an implementation defined bitmask + type (17.3.2.1.2). Setting its elements has the effects listed in the table + below, a valid value of type syntax_option_type will always have + exactly one of the elements normal, basic, extended, awk, grep, egrep, sed + or perl set.

+

Note that for convenience all the constants listed here are duplicated within + the scope of class template basic_regex, so you can use any of:

+
boost::regex_constants::constant_name
+

or

+
boost::regex::constant_name
+

or

+
boost::wregex::constant_name
+

in an interchangeable manner.

+

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Element

+
+

Effect if set

+
+

normal

+
+

Specifies that the grammar recognized by the regular expression engine uses its + normal semantics: that is the same as that given in the ECMA-262, ECMAScript + Language Specification, Chapter 15 part 10, RegExp (Regular Expression) Objects + (FWD.1).

+

boost.regex also recognizes most perl-compatible extensions in this mode.

+
+

icase

+
+

Specifies that matching of regular expressions against a character container + sequence shall be performed without regard to case.

+
+

nosubs

+
+

Specifies that when a regular expression is matched against a character + container sequence, then no sub-expression matches are to be stored in the + supplied match_results structure.

+
+

optimize

+
+

Specifies that the regular expression engine should pay more attention to the + speed with which regular expressions are matched, and less to the speed with + which regular expression objects are constructed. Otherwise it has no + detectable effect on the program output.  This currently has no effect for + boost.regex.

+
+

collate

+
+

Specifies that character ranges of the form "[a-b]" should be locale sensitive.

+
+

ECMAScript

+
+

The same as normal.

+
+

JavaScript

+
+

The same as normal.

+
+

JScript

+
+

The same as normal.

+
+

basic

+
+

Specifies that the grammar recognized by the regular expression engine is the + same as that used by POSIX basic regular expressions in IEEE Std 1003.1-2001, + Portable Operating System Interface (POSIX ), Base Definitions and Headers, + Section 9, Regular Expressions (FWD.1). +

+
+

extended

+
+

Specifies that the grammar recognized by the regular expression engine is the + same as that used by POSIX extended regular expressions in IEEE Std + 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and + Headers, Section 9, Regular Expressions (FWD.1).

+
+

awk

+
+

Specifies that the grammar recognized by the regular expression engine is the + same as that used by POSIX utility awk in IEEE Std 1003.1-2001, Portable + Operating System Interface (POSIX ), Shells and Utilities, Section 4, awk + (FWD.1).

+

That is to say: the same as POSIX extended syntax, but with escape sequences in + character classes permitted.

+
+

grep

+
+

Specifies that the grammar recognized by the regular expression engine is the + same as that used by POSIX utility grep in IEEE Std 1003.1-2001, Portable + Operating System Interface (POSIX ), Shells and Utilities, Section 4, + Utilities, grep (FWD.1).

+

That is to say, the same as POSIX basic syntax, but with the newline character + acting as an alternation character in addition to "|".

+
+

egrep

+
+

Specifies that the grammar recognized by the regular expression engine is the + same as that used by POSIX utility grep when given the -E option in IEEE Std + 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and + Utilities, Section 4, Utilities, grep (FWD.1).

+

That is to say, the same as POSIX extended syntax, but with the newline + character acting as an alternation character in addition to "|".

+
+

sed

+
+

The same as basic.

+
+

perl

+
+

The same as normal.

+
+

+

The following constants are specific to this particular regular expression + implementation and do not appear in the + regular expression standardization proposal:

+

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
regbase::escape_in_listsAllows the use of the escape "\" character in sets of + characters, for example [\]] represents the set of characters containing only + "]". If this flag is not set then "\" is an ordinary character inside sets.
regbase::char_classesWhen this bit is set, character classes [:classname:] + are allowed inside character set declarations, for example "[[:word:]]" + represents the set of all characters that belong to the character class "word".
regbase:: intervalsWhen this bit is set, repetition intervals are + allowed, for example "a{2,4}" represents a repeat of between 2 and 4 letter + a's.
regbase:: limited_opsWhen this bit is set all of "+", "?" and "|" are + ordinary characters in all situations.
regbase:: newline_altWhen this bit is set, then the newline character "\n" + has the same effect as the alternation operator "|".
regbase:: bk_plus_qmWhen this bit is set then "\+" represents the one or + more repetition operator and "\?" represents the zero or one repetition + operator. When this bit is not set then "+" and "?" are used instead.
regbase:: bk_bracesWhen this bit is set then "\{" and "\}" are used for + bounded repetitions and "{" and "}" are normal characters. This is the opposite + of default behavior.
regbase:: bk_parensWhen this bit is set then "\(" and "\)" are used to + group sub-expressions and "(" and ")" are ordinary characters, this is the + opposite of default behavior.
regbase:: bk_refsWhen this bit is set then back references are + allowed.
regbase:: bk_vbarWhen this bit is set then "\|" represents the + alternation operator and "|" is an ordinary character. This is the opposite of + default behavior.
regbase:: use_exceptWhen this bit is set then a bad_expression + exception will be thrown on error.  Use of this flag is deprecated - + basic_regex will always throw on error.
regbase:: failbitThis bit is set on error, if regbase::use_except is + not set, then this bit should be checked to see if a regular expression is + valid before usage.
regbase::literalAll characters in the string are treated as literals, + there are no special characters or escape sequences.
regbase::emacsProvides compatability with the emacs + editor, eqivalent to: bk_braces | bk_parens | bk_refs | bk_vbar.
+

+
+

Revised + + 17 May 2003 +

+

© Copyright John Maddock 1998- 2003

+

Permission to use, copy, modify, distribute and sell this software + and its documentation for any purpose is hereby granted without fee, provided + that the above copyright notice appear in all copies and that both that + copyright notice and this permission notice appear in supporting documentation. + Dr John Maddock makes no representations about the suitability of this software + for any purpose. It is provided "as is" without express or implied warranty.

+ + + + diff --git a/doc/Attic/thread_safety.html b/doc/Attic/thread_safety.html new file mode 100644 index 00000000..eeda681d --- /dev/null +++ b/doc/Attic/thread_safety.html @@ -0,0 +1,68 @@ + + + + Boost.Regex: Thread Safety + + + + +

+ + + + + + +
+

C++ Boost

+
+

Boost.Regex

+

Thread Safety

+
+

Boost.Regex Index

+
+

+
+

Class basic_regex<> and its typedefs regex + and wregex are thread safe, in that compiled regular expressions can safely be + shared between threads. The matching algorithms regex_match, + regex_search, regex_grep, + regex_format and regex_merge + are all re-entrant and thread safe. Class match_results + is now thread safe, in that the results of a match can be safely copied from + one thread to another (for example one thread may find matches and push + match_results instances onto a queue, while another thread pops them off the + other end), otherwise use a separate instance of match_results + per thread. +

+

The POSIX API functions are all re-entrant and + thread safe, regular expressions compiled with regcomp can also be + shared between threads. +

+

The class RegEx is only thread safe if each thread + gets its own RegEx instance (apartment threading) - this is a consequence of + RegEx handling both compiling and matching regular expressions. +

+

Finally note that changing the global locale invalidates all compiled regular + expressions, therefore calling set_locale from one thread while another + uses regular expressions will produce unpredictable results. +

+

+ There is also a requirement that there is only one thread executing prior to + the start of main().

+
+

Revised + + 17 May 2003 + +

+

© Copyright John Maddock 1998- 2003

+

Permission to use, copy, modify, distribute and sell this software + and its documentation for any purpose is hereby granted without fee, provided + that the above copyright notice appear in all copies and that both that + copyright notice and this permission notice appear in supporting documentation. + Dr John Maddock makes no representations about the suitability of this software + for any purpose. It is provided "as is" without express or implied warranty.

+ + + diff --git a/doc/Attic/uarrow.gif b/doc/Attic/uarrow.gif new file mode 100644 index 00000000..6afd20c3 Binary files /dev/null and b/doc/Attic/uarrow.gif differ diff --git a/doc/standards.html b/doc/standards.html new file mode 100644 index 00000000..35a2e67e --- /dev/null +++ b/doc/standards.html @@ -0,0 +1,79 @@ + + + + Boost.Regex: Standards Conformance + + + + +

+ + + + + + +
+

C++ Boost

+
+

Boost.Regex

+

Standards Conformance

+
+

Boost.Regex Index

+
+

+
+

C++

+

Boost.regex is intended to conform to the + regular expression standardization proposal, which will appear in a + future C++ standard technical report (and hopefully in a future version of the + standard).  Currently there are some differences in how the regular + expression traits classes are defined, these will be fixed in a future release.

+

ECMAScript / JavaScript

+

All of the ECMAScript regular expression syntax features are supported, except + that:

+

Negated class escapes (\S, \D and \W) are not permitted inside character class + definitions ( [...] ).

+

The escape sequence \u matches any upper case character (the same as + [[:upper:]]) rather than a Unicode escape sequence; use \x{DDDD} for + Unicode escape sequences.

+

Perl

+

Almost all Perl features are supported, except for:

+

\N{name}  Use [[:name:]] instead.

+

\pP and \PP

+

(?imsx-imsx)

+

(?<=pattern)

+

(?<!pattern)

+

(?{code})

+

(??{code})

+

(?(condition)yes-pattern) and (?(condition)yes-pattern|no-pattern)

+

These embarrassments / limitations will be removed in due course, mainly + dependent upon user demand.

+

POSIX

+

All the POSIX basic and extended regular expression features are supported, + except that:

+

No character collating names are recognized except those specified in the POSIX + standard for the C locale, unless they are explicitly registered with the + traits class.

+

Character equivalence classes ( [[=a=]] etc) are probably buggy except on + Win32.  Implementing this feature requires knowledge of the format of the + string sort keys produced by the system; if you need this, and the default + implementation doesn't work on your platform, then you will need to supply a + custom traits class.

+
+

Revised + + 17 May 2003 + +

+

© Copyright John Maddock 1998- 2003

+

Permission to use, copy, modify, distribute and sell this software + and its documentation for any purpose is hereby granted without fee, provided + that the above copyright notice appear in all copies and that both that + copyright notice and this permission notice appear in supporting documentation. + Dr John Maddock makes no representations about the suitability of this software + for any purpose. It is provided "as is" without express or implied warranty.

+ + + + diff --git a/doc/sub_match.html b/doc/sub_match.html new file mode 100644 index 00000000..db995312 --- /dev/null +++ b/doc/sub_match.html @@ -0,0 +1,426 @@ + + + + Boost.Regex: sub_match + + + + +

+ + + + + + +
+

C++ Boost

+
+

Boost.Regex

+

sub_match

+
+

Boost.Regex Index

+
+

+
+

Synopsis

+

#include <boost/regex.hpp> +

+

Regular expressions are different from many simple pattern-matching algorithms + in that as well as finding an overall match they can also produce + sub-expression matches: each sub-expression being delimited in the pattern by a + pair of parenthesis (...). There has to be some method for reporting + sub-expression matches back to the user: this is achieved this by defining a + class match_results that acts as an + indexed collection of sub-expression matches, each sub-expression match being + contained in an object of type sub_match + . +

Objects of type sub_match may only obtained by subscripting an object + of type match_results + . +

When the marked sub-expression denoted by an object of type sub_match<> + participated in a regular expression match then member matched evaluates + to true, and members first and second denote the + range of characters [first,second) which formed that match. + Otherwise matched is false, and members first and second + contained undefined values.

+

If an object of type sub_match<> represents sub-expression 0 + - that is to say the whole match - then member matched is always + true, unless a partial match was obtained as a result of the flag match_partial + being passed to a regular expression algorithm, in which case member matched + is false, and members first and second represent the + character range that formed the partial match.

+
+namespace boost{
+      
+template <class BidirectionalIterator>
+class sub_match : public std::pair<BidirectionalIterator, BidirectionalIterator>
+{
+public:
+   typedef typename iterator_traits<BidirectionalIterator>::value_type       value_type;
+   typedef typename iterator_traits<BidirectionalIterator>::difference_type  difference_type;
+   typedef          BidirectionalIterator                                    iterator;
+
+   bool matched;
+
+   difference_type length()const;
+   operator basic_string<value_type>()const;
+   basic_string<value_type> str()const;
+
+   int compare(const sub_match& s)const;
+   int compare(const basic_string<value_type>& s)const;
+   int compare(const value_type* s)const;
+};
+
+template <class BidirectionalIterator>
+bool operator == (const sub_match<BidirectionalIterator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator>
+bool operator != (const sub_match<BidirectionalIterator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator>
+bool operator < (const sub_match<BidirectionalIterator>& lhs,
+                 const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator>
+bool operator <= (const sub_match<BidirectionalIterator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator>
+bool operator >= (const sub_match<BidirectionalIterator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator>
+bool operator > (const sub_match<BidirectionalIterator>& lhs,
+                 const sub_match<BidirectionalIterator>& rhs);
+
+
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator == (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator != (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator < (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
+                 const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator > (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
+                 const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator >= (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator <= (const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator == (const sub_match<BidirectionalIterator>& lhs,
+                  const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator != (const sub_match<BidirectionalIterator>& lhs,
+                  const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator < (const sub_match<BidirectionalIterator>& lhs,
+                 const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator > (const sub_match<BidirectionalIterator>& lhs,
+                 const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator >= (const sub_match<BidirectionalIterator>& lhs,
+                  const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
+template <class BidirectionalIterator, class traits, class Allocator> 
+bool operator <= (const sub_match<BidirectionalIterator>& lhs,
+                  const std::basic_string<iterator_traits<BidirectionalIterator>::value_type, traits, Allocator>& rhs);
+
+template <class BidirectionalIterator> 
+bool operator == (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator != (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator < (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                 const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator > (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                 const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator >= (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator <= (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+
+template <class BidirectionalIterator> 
+bool operator == (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+template <class BidirectionalIterator> 
+bool operator != (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+template <class BidirectionalIterator> 
+bool operator < (const sub_match<BidirectionalIterator>& lhs, 
+                 typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+template <class BidirectionalIterator> 
+bool operator > (const sub_match<BidirectionalIterator>& lhs, 
+                 typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+template <class BidirectionalIterator> 
+bool operator >= (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+template <class BidirectionalIterator> 
+bool operator <= (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+
+template <class BidirectionalIterator> 
+bool operator == (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator != (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator < (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                 const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator > (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                 const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator >= (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+template <class BidirectionalIterator> 
+bool operator <= (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+
+template <class BidirectionalIterator> 
+bool operator == (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+template <class BidirectionalIterator> 
+bool operator != (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+template <class BidirectionalIterator> 
+bool operator < (const sub_match<BidirectionalIterator>& lhs, 
+                 typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+template <class BidirectionalIterator> 
+bool operator > (const sub_match<BidirectionalIterator>& lhs, 
+                 typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+template <class BidirectionalIterator> 
+bool operator >= (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+template <class BidirectionalIterator> 
+bool operator <= (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+
+template <class charT, class traits, class BidirectionalIterator>
+basic_ostream<charT, traits>&
+   operator << (basic_ostream<charT, traits>& os,
+                const sub_match<BidirectionalIterator>& m);
+
+} // namespace boost
+

Description

+

+ sub_match members

+
typedef typename std::iterator_traits<iterator>::value_type value_type;
+

The type pointed to by the iterators.

+
typedef typename std::iterator_traits<iterator>::difference_type difference_type;
+

A type that represents the difference between two iterators.

+
typedef iterator iterator_type;
+

The iterator type.

+
iterator first
+

An iterator denoting the position of the start of the match.

+
iterator second
+

An iterator denoting the position of the end of the match.

+
bool matched
+

A Boolean value denoting whether this sub-expression participated in the match.

+
static difference_type length();
+ +

+ Effects: returns (matched ? 0 : distance(first, second)).

operator basic_string<value_type>()const;
+ +

+ Effects: returns (matched ? basic_string<value_type>(first, + second) : basic_string<value_type>()).

basic_string<value_type> str()const;
+ +

+ Effects: returns (matched ? basic_string<value_type>(first, + second) : basic_string<value_type>()).

int compare(const sub_match& s)const;
+ +

+ Effects: returns str().compare(s.str()).

int compare(const basic_string<value_type>& s)const;
+ +

+ Effects: returns str().compare(s).

int compare(const value_type* s)const;
+ +

+ Effects: returns str().compare(s).

+

+ sub_match non-member operators

+
template <class BidirectionalIterator>
+bool operator == (const sub_match<BidirectionalIterator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+ +

+ Effects: returns lhs.compare(rhs) == 0.

template <class BidirectionalIterator>
+bool operator != (const sub_match<BidirectionalIterator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+ +

+ Effects: returns lhs.compare(rhs) != 0.

template <class BidirectionalIterator>
+bool operator < (const sub_match<BidirectionalIterator>& lhs,
+                 const sub_match<BidirectionalIterator>& rhs);
+ +

+ Effects: returns lhs.compare(rhs) < 0.

template <class BidirectionalIterator>
+bool operator <= (const sub_match<BidirectionalIterator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+ +

+ Effects: returns lhs.compare(rhs) <= 0.

template <class BidirectionalIterator>
+bool operator >= (const sub_match<BidirectionalIterator>& lhs,
+                  const sub_match<BidirectionalIterator>& rhs);
+ +

+ Effects: returns lhs.compare(rhs) >= 0.

template <class BidirectionalIterator>
+bool operator > (const sub_match<BidirectionalIterator>& lhs,
+                 const sub_match<BidirectionalIterator>& rhs);
+ +

+ Effects: returns lhs.compare(rhs) > 0.

template <class BidirectionalIterator> 
+bool operator == (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs == rhs.str().

template <class BidirectionalIterator> 
+bool operator != (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs != rhs.str().

template <class BidirectionalIterator> 
+bool operator < (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                 const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs < rhs.str().

template <class BidirectionalIterator> 
+bool operator > (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                 const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs > rhs.str().

template <class BidirectionalIterator> 
+bool operator >= (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs >= rhs.str().

template <class BidirectionalIterator> 
+bool operator <= (typename iterator_traits<BidirectionalIterator>::value_type const* lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs <= rhs.str().

template <class BidirectionalIterator> 
+bool operator == (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+ +

+ Effects: returns lhs.str() == rhs.

template <class BidirectionalIterator> 
+bool operator != (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+ +

+ Effects: returns lhs.str() != rhs.

template <class BidirectionalIterator> 
+bool operator < (const sub_match<BidirectionalIterator>& lhs, 
+                 typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+ +

+ Effects: returns lhs.str() < rhs.

template <class BidirectionalIterator> 
+bool operator > (const sub_match<BidirectionalIterator>& lhs, 
+                 typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+ +

+ Effects: returns lhs.str() > rhs.

template <class BidirectionalIterator> 
+bool operator >= (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+ +

+ Effects: returns lhs.str() >= rhs.

template <class BidirectionalIterator> 
+bool operator <= (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const* rhs); 
+ +

+ Effects: returns lhs.str() <= rhs.

template <class BidirectionalIterator> 
+bool operator == (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs == rhs.str().

template <class BidirectionalIterator> 
+bool operator != (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs != rhs.str().

template <class BidirectionalIterator> 
+bool operator < (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                 const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs < rhs.str().

template <class BidirectionalIterator> 
+bool operator > (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                 const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs > rhs.str().

template <class BidirectionalIterator> 
+bool operator >= (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs >= rhs.str().

template <class BidirectionalIterator> 
+bool operator <= (typename iterator_traits<BidirectionalIterator>::value_type const& lhs, 
+                  const sub_match<BidirectionalIterator>& rhs); 
+ +

+ Effects: returns lhs <= rhs.str().

template <class BidirectionalIterator> 
+bool operator == (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+ +

+ Effects: returns lhs.str() == rhs.

template <class BidirectionalIterator> 
+bool operator != (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+ +

+ Effects: returns lhs.str() != rhs.

template <class BidirectionalIterator> 
+bool operator < (const sub_match<BidirectionalIterator>& lhs, 
+                 typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+ +

+ Effects: returns lhs.str() < rhs.

template <class BidirectionalIterator> 
+bool operator > (const sub_match<BidirectionalIterator>& lhs, 
+                 typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+ +

+ Effects: returns lhs.str() > rhs.

template <class BidirectionalIterator> 
+bool operator >= (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+ +

+ Effects: returns lhs.str() >= rhs.

template <class BidirectionalIterator> 
+bool operator <= (const sub_match<BidirectionalIterator>& lhs, 
+                  typename iterator_traits<BidirectionalIterator>::value_type const& rhs); 
+ +

+ Effects: returns lhs.str() <= rhs.

template <class charT, class traits, class BidirectionalIterator>
+basic_ostream<charT, traits>&
+   operator << (basic_ostream<charT, traits>& os
+                const sub_match<BidirectionalIterator>& m);
+ +

+ Effects: returns (os << m.str()). +


+

Revised + + 17 May 2003 + +

+

© Copyright John Maddock 1998- 2003

+

Permission to use, copy, modify, distribute and sell this software + and its documentation for any purpose is hereby granted without fee, provided + that the above copyright notice appear in all copies and that both that + copyright notice and this permission notice appear in supporting documentation. + Dr John Maddock makes no representations about the suitability of this software + for any purpose. It is provided "as is" without express or implied warranty.

+ + + + diff --git a/doc/syntax.html b/doc/syntax.html new file mode 100644 index 00000000..f776cd3c --- /dev/null +++ b/doc/syntax.html @@ -0,0 +1,773 @@ + + + + Boost.Regex: Regular Expression Syntax + + + + +

+ + + + + + +
+

C++ Boost

+
+

Boost.Regex

+

Regular Expression Syntax

+
+

Boost.Regex Index

+
+

+
+

This section covers the regular expression syntax used by this library, this is + a programmers guide, the actual syntax presented to your program's users will + depend upon the flags used during expression compilation. +

+

Literals +

+

All characters are literals except: ".", "|", "*", "?", "+", "(", ")", "{", + "}", "[", "]", "^", "$" and "\". These characters are literals when preceded by + a "\". A literal is a character that matches itself, or matches the result of + traits_type::translate(), where traits_type is the traits template parameter to + class basic_regex.

+

Wildcard +

+

The dot character "." matches any single character except : when match_not_dot_null + is passed to the matching algorithms, the dot does not match a null character; + when match_not_dot_newline is passed to the matching algorithms, then + the dot does not match a newline character. +

+

Repeats +

+

A repeat is an expression that is repeated an arbitrary number of times. An + expression followed by "*" can be repeated any number of times including zero. + An expression followed by "+" can be repeated any number of times, but at least + once, if the expression is compiled with the flag regex_constants::bk_plus_qm + then "+" is an ordinary character and "\+" represents a repeat of once or more. + An expression followed by "?" may be repeated zero or one times only, if the + expression is compiled with the flag regex_constants::bk_plus_qm then "?" is an + ordinary character and "\?" represents the repeat zero or once operator. When + it is necessary to specify the minimum and maximum number of repeats + explicitly, the bounds operator "{}" may be used, thus "a{2}" is the letter "a" + repeated exactly twice, "a{2,4}" represents the letter "a" repeated between 2 + and 4 times, and "a{2,}" represents the letter "a" repeated at least twice with + no upper limit. Note that there must be no white-space inside the {}, and there + is no upper limit on the values of the lower and upper bounds. When the + expression is compiled with the flag regex_constants::bk_braces then "{" and + "}" are ordinary characters and "\{" and "\}" are used to delimit bounds + instead. All repeat expressions refer to the shortest possible previous + sub-expression: a single character; a character set, or a sub-expression + grouped with "()" for example. +

+

Examples: +

+

"ba*" will match all of "b", "ba", "baaa" etc. +

+

"ba+" will match "ba" or "baaaa" for example but not "b". +

+

"ba?" will match "b" or "ba". +

+

"ba{2,4}" will match "baa", "baaa" and "baaaa". +

+

Non-greedy repeats +

+

Whenever the "extended" regular expression syntax is in use (the default) then + non-greedy repeats are possible by appending a '?' after the repeat; a + non-greedy repeat is one which will match the shortest possible string. +

+

For example to match html tag pairs one could use something like: +

+

"<\s*tagname[^>]*>(.*?)<\s*/tagname\s*>" +

+

In this case $1 will contain the text between the tag pairs, and will be the + shortest possible matching string.  +

+

Parenthesis +

+

Parentheses serve two purposes, to group items together into a sub-expression, + and to mark what generated the match. For example the expression "(ab)*" would + match all of the string "ababab". The matching algorithms + regex_match and regex_search + each take an instance of match_results + that reports what caused the match, on exit from these functions the + match_results contains information both on what the whole expression + matched and on what each sub-expression matched. In the example above + match_results[1] would contain a pair of iterators denoting the final "ab" of + the matching string. It is permissible for sub-expressions to match null + strings. If a sub-expression takes no part in a match - for example if it is + part of an alternative that is not taken - then both of the iterators that are + returned for that sub-expression point to the end of the input string, and the matched + parameter for that sub-expression is false. Sub-expressions are indexed + from left to right starting from 1, sub-expression 0 is the whole expression. +

+

Non-Marking Parenthesis +

+

Sometimes you need to group sub-expressions with parenthesis, but don't want + the parenthesis to spit out another marked sub-expression, in this case a + non-marking parenthesis (?:expression) can be used. For example the following + expression creates no sub-expressions: +

+

"(?:abc)*"

+

Forward Lookahead Asserts  +

+

There are two forms of these; one for positive forward lookahead asserts, and + one for negative lookahead asserts:

+

"(?=abc)" matches zero characters only if they are followed by the expression + "abc".

+

"(?!abc)" matches zero characters only if they are not followed by the + expression "abc".

+

Independent sub-expressions

+

"(?>expression)" matches "expression" as an independent atom (the algorithm + will not backtrack into it if a failure occurs later in the expression).

+

Alternatives +

+

Alternatives occur when the expression can match either one sub-expression or + another, each alternative is separated by a "|", or a "\|" if the flag + regex_constants::bk_vbar is set, or by a newline character if the flag + regex_constants::newline_alt is set. Each alternative is the largest possible + previous sub-expression; this is the opposite behavior from repetition + operators. +

+

Examples: +

+

"a(b|c)" could match "ab" or "ac". +

+

"abc|def" could match "abc" or "def". +

+

Sets +

+

A set is a set of characters that can match any single character that is a + member of the set. Sets are delimited by "[" and "]" and can contain literals, + character ranges, character classes, collating elements and equivalence + classes. Set declarations that start with "^" contain the compliment of the + elements that follow. +

+

Examples: +

+

Character literals: +

+

"[abc]" will match either of "a", "b", or "c". +

+

"[^abc] will match any character other than "a", "b", or "c". +

+

Character ranges: +

+

"[a-z]" will match any character in the range "a" to "z". +

+

"[^A-Z]" will match any character other than those in the range "A" to "Z". +

+

Note that character ranges are highly locale dependent if the flag + regex_constants::collate is set: they match any character that collates between + the endpoints of the range, ranges will only behave according to ASCII rules + when the default "C" locale is in effect. For example if the library is + compiled with the Win32 localization model, then [a-z] will match the ASCII + characters a-z, and also 'A', 'B' etc, but not 'Z' which collates just after + 'z'. This locale specific behavior is disabled by default (in perl mode), and + forces ranges to collate according to ASCII character code. +

+

Character classes are denoted using the syntax "[:classname:]" within a set + declaration, for example "[[:space:]]" is the set of all whitespace characters. + Character classes are only available if the flag regex_constants::char_classes + is set. The available character classes are: +
+   +

+

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
 alnumAny alpha numeric character. 
 alphaAny alphabetical character a-z and A-Z. Other + characters may also be included depending upon the locale. 
 blankAny blank character, either a space or a tab. 
 cntrlAny control character. 
 digitAny digit 0-9. 
 graphAny graphical character. 
 lowerAny lower case character a-z. Other characters may + also be included depending upon the locale. 
 printAny printable character. 
 punctAny punctuation character. 
 spaceAny whitespace character. 
 upperAny upper case character A-Z. Other characters may + also be included depending upon the locale. 
 xdigitAny hexadecimal digit character, 0-9, a-f and A-F. 
 wordAny word character - all alphanumeric characters plus + the underscore. 
 UnicodeAny character whose code is greater than 255, this + applies to the wide character traits classes only. 
+

+

There are some shortcuts that can be used in place of the character classes, + provided the flag regex_constants::escape_in_lists is set then you can use: +

+

\w in place of [:word:] +

+

\s in place of [:space:] +

+

\d in place of [:digit:] +

+

\l in place of [:lower:] +

+

\u in place of [:upper:]  +

+

Collating elements take the general form [.tagname.] inside a set declaration, + where tagname is either a single character, or a name of a collating + element, for example [[.a.]] is equivalent to [a], and [[.comma.]] is + equivalent to [,]. The library supports all the standard POSIX collating + element names, and in addition the following digraphs: "ae", "ch", "ll", "ss", + "nj", "dz", "lj", each in lower, upper and title case variations. + Multi-character collating elements can result in the set matching more than one + character, for example [[.ae.]] would match two characters, but note that + [^[.ae.]] would only match one character.  +

+

+ Equivalence classes take the general form[=tagname=] inside a set declaration, + where tagname is either a single character, or a name of a collating + element, and matches any character that is a member of the same primary + equivalence class as the collating element [.tagname.]. An equivalence class is + a set of characters that collate the same, a primary equivalence class is a set + of characters whose primary sort key are all the same (for example strings are + typically collated by character, then by accent, and then by case; the primary + sort key then relates to the character, the secondary to the accentation, and + the tertiary to the case). If there is no equivalence class corresponding to tagname + , then[=tagname=] is exactly the same as [.tagname.]. Unfortunately there is no + locale independent method of obtaining the primary sort key for a character, + except under Win32. For other operating systems the library will "guess" the + primary sort key from the full sort key (obtained from strxfrm), so + equivalence classes are probably best considered broken under any operating + system other than Win32.  +

+

To include a literal "-" in a set declaration then: make it the first character + after the opening "[" or "[^", the endpoint of a range, a collating element, or + if the flag regex_constants::escape_in_lists is set then precede with an escape + character as in "[\-]". To include a literal "[" or "]" or "^" in a set then + make them the endpoint of a range, a collating element, or precede with an + escape character if the flag regex_constants::escape_in_lists is set. +

+

Line anchors +

+

An anchor is something that matches the null string at the start or end of a + line: "^" matches the null string at the start of a line, "$" matches the null + string at the end of a line. +

+

Back references +

+

A back reference is a reference to a previous sub-expression that has already + been matched, the reference is to what the sub-expression matched, not to the + expression itself. A back reference consists of the escape character "\" + followed by a digit "1" to "9", "\1" refers to the first sub-expression, "\2" + to the second etc. For example the expression "(.*)\1" matches any string that + is repeated about its mid-point for example "abcabc" or "xyzxyz". A back + reference to a sub-expression that did not participate in any match, matches + the null string: NB this is different to some other regular expression + matchers. Back references are only available if the expression is compiled with + the flag regex_constants::bk_refs set. +

+

Characters by code +

+

This is an extension to the algorithm that is not available in other libraries, + it consists of the escape character followed by the digit "0" followed by the + octal character code. For example "\023" represents the character whose octal + code is 23. Where ambiguity could occur use parentheses to break the expression + up: "\0103" represents the character whose code is 103, "(\010)3 represents the + character 10 followed by "3". To match characters by their hexadecimal code, + use \x followed by a string of hexadecimal digits, optionally enclosed inside + {}, for example \xf0 or \x{aff}, notice the latter example is a Unicode + character.

+

Word operators +

+

The following operators are provided for compatibility with the GNU regular + expression library. +

+

"\w" matches any single character that is a member of the "word" character + class, this is identical to the expression "[[:word:]]". +

+

"\W" matches any single character that is not a member of the "word" character + class, this is identical to the expression "[^[:word:]]". +

+

"\<" matches the null string at the start of a word. +

+

"\>" matches the null string at the end of the word. +

+

"\b" matches the null string at either the start or the end of a word. +

+

"\B" matches a null string within a word. +

+

The start of the sequence passed to the matching algorithms is considered to be + a potential start of a word unless the flag match_not_bow is set. The end of + the sequence passed to the matching algorithms is considered to be a potential + end of a word unless the flag match_not_eow is set. +

+

Buffer operators +

+

The following operators are provided for compatibility with the GNU regular + expression library, and Perl regular expressions: +

+

"\`" matches the start of a buffer. +

+

"\A" matches the start of the buffer. +

+

"\'" matches the end of a buffer. +

+

"\z" matches the end of a buffer. +

+

"\Z" matches the end of a buffer, or possibly one or more new line characters + followed by the end of the buffer. +

+

A buffer is considered to consist of the whole sequence passed to the matching + algorithms, unless the flags match_not_bob or match_not_eob are set. +

+

Escape operator +

+

The escape character "\" has several meanings. +

+

Inside a set declaration the escape character is a normal character unless the + flag regex_constants::escape_in_lists is set in which case whatever follows the + escape is a literal character regardless of its normal meaning. +

+

The escape operator may introduce an operator for example: back references, or + a word operator. +

+

The escape operator may make the following character normal, for example "\*" + represents a literal "*" rather than the repeat operator. +

+

Single character escape sequences +

+

The following escape sequences are aliases for single characters: +
+   +

+

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
 Escape sequence + Character code + Meaning +  
 \a + 0x07 + Bell character. +  
 \f + 0x0C + Form feed. +  
 \n + 0x0A + Newline character. +  
 \r + 0x0D + Carriage return. +  
 \t + 0x09 + Tab character. +  
 \v + 0x0B + Vertical tab. +  
 \e + 0x1B + ASCII Escape character. +  
 \0dd + 0dd + An octal character code, where dd is one or + more octal digits. +  
 \xXX + 0xXX + A hexadecimal character code, where XX is one or more + hexadecimal digits. +  
 \x{XX} + 0xXX + A hexadecimal character code, where XX is one or more + hexadecimal digits, optionally a Unicode character. +  
 \cZ + z-@ + An ASCII escape sequence control-Z, where Z is any + ASCII character greater than or equal to the character code for '@'. +  
+

+

Miscellaneous escape sequences: +

+

The following are provided mostly for perl compatibility, but note that there + are some differences in the meanings of \l \L \u and \U: +
+   +

+

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
 \w + Equivalent to [[:word:]]. +  
 \W + Equivalent to [^[:word:]]. +  
 \s + Equivalent to [[:space:]]. +  
 \S + Equivalent to [^[:space:]]. +  
 \d + Equivalent to [[:digit:]]. +  
 \D + Equivalent to [^[:digit:]]. +  
 \l + Equivalent to [[:lower:]]. +  
 \L + Equivalent to [^[:lower:]]. +  
 \u + Equivalent to [[:upper:]]. +  
 \U + Equivalent to [^[:upper:]]. +  
 \C + Any single character, equivalent to '.'. +  
 \X + Match any Unicode combining character sequence, for + example "a\x 0301" (a letter a with an acute). +  
 \Q + The begin quote operator, everything that follows is + treated as a literal character until a \E end quote operator is found. +  
 \E + The end quote operator, terminates a sequence begun + with \Q. +  
+

+

What gets matched? +

+

+ When the expression is compiled as a Perl-compatible regex then the matching + algorithms will perform a depth first search on the state machine and report + the first match found.

+

+ When the expression is compiled as a POSIX-compatible regex then the matching + algorithms will match the first possible matching string, if more than one + string starting at a given location can match then it matches the longest + possible string, unless the flag match_any is set, in which case the first + match encountered is returned. Use of the match_any option can reduce the time + taken to find the match - but is only useful if the user is less concerned + about what matched - for example it would not be suitable for search and + replace operations. In cases where their are multiple possible matches all + starting at the same location, and all of the same length, then the match + chosen is the one with the longest first sub-expression, if that is the same + for two or more matches, then the second sub-expression will be examined and so + on. +

+ The following table examples illustrate the main differences between Perl and + POSIX regular expression matching rules: +

+

+ + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Expression

+
+

Text

+
+

POSIX leftmost longest match

+
+

ECMAScript depth first search match

+
+

a|ab

+
+

+ xaby +

+
+

+ "ab"

+

+ "a"

+

+ .*([[:alnum:]]+).*

+

+ " abc def xyz "

+

$0 = " abc def xyz "
+ $1 = "abc"

+
+

$0 = " abc def xyz "
+ $1 = "z"

+
+

+ .*(a|xayy)

+

+ zzxayyzz

+

+ "zzxayy"

+

"zzxa"

+
+

These differences between Perl matching rules, and POSIX matching rules, mean + that these two regular expression syntaxes differ not only in the features + offered, but also in the form that the state machine takes and/or the + algorithms used to traverse the state machine.

+
+

Revised + + 17 May 2003 + +

+

© Copyright John Maddock 1998- 2003

+

Permission to use, copy, modify, distribute and sell this software + and its documentation for any purpose is hereby granted without fee, provided + that the above copyright notice appear in all copies and that both that + copyright notice and this permission notice appear in supporting documentation. + Dr John Maddock makes no representations about the suitability of this software + for any purpose. It is provided "as is" without express or implied warranty.

+ + + + diff --git a/doc/syntax_option_type.html b/doc/syntax_option_type.html new file mode 100644 index 00000000..532d6386 --- /dev/null +++ b/doc/syntax_option_type.html @@ -0,0 +1,332 @@ + + + + Boost.Regex: syntax_option_type + + + + +

+ + + + + + +
+

C++ Boost

+
+

Boost.Regex

+

syntax_option_type

+
+

Boost.Regex Index

+
+

+
+

Synopsis

+

Type syntax_option type is an implementation defined bitmask type that controls + how a regular expression string is to be interpreted.  For convenience + note that all the constants listed here, are also duplicated within the scope + of class template basic_regex.

+
namespace std{ namespace regex_constants{
+
+typedef bitmask_type syntax_option_type;
+// these flags are standardized:
+static const syntax_option_type normal;
+static const syntax_option_type icase;
+static const syntax_option_type nosubs;
+static const syntax_option_type optimize;
+static const syntax_option_type collate;
+static const syntax_option_type ECMAScript = normal;
+static const syntax_option_type JavaScript = normal;
+static const syntax_option_type JScript = normal;
+static const syntax_option_type basic;
+static const syntax_option_type extended;
+static const syntax_option_type awk;
+static const syntax_option_type grep;
+static const syntax_option_type egrep;
+static const syntax_option_type sed = basic;
+static const syntax_option_type perl;
// these are boost.regex specific:
static const syntax_option_type escape_in_lists;
static const syntax_option_type char_classes;
static const syntax_option_type intervals;
static const syntax_option_type limited_ops;
static const syntax_option_type newline_alt;
static const syntax_option_type bk_plus_qm;
static const syntax_option_type bk_braces;
static const syntax_option_type bk_parens;
static const syntax_option_type bk_refs;
static const syntax_option_type bk_vbar;
static const syntax_option_type use_except;
static const syntax_option_type failbit;
static const syntax_option_type literal;
static const syntax_option_type nocollate;
static const syntax_option_type perlex;
static const syntax_option_type emacs;
+} // namespace regex_constants +} // namespace std
+

Description

+

The type syntax_option_type is an implementation defined bitmask + type (17.3.2.1.2). Setting its elements has the effects listed in the table + below, a valid value of type syntax_option_type will always have + exactly one of the elements normal, basic, extended, awk, grep, egrep, sed + or perl set.

+

Note that for convenience all the constants listed here are duplicated within + the scope of class template basic_regex, so you can use any of:

+
boost::regex_constants::constant_name
+

or

+
boost::regex::constant_name
+

or

+
boost::wregex::constant_name
+

in an interchangeable manner.

+

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Element

+
+

Effect if set

+
+

normal

+
+

Specifies that the grammar recognized by the regular expression engine uses its + normal semantics: that is the same as that given in the ECMA-262, ECMAScript + Language Specification, Chapter 15 part 10, RegExp (Regular Expression) Objects + (FWD.1).

+

boost.regex also recognizes most perl-compatible extensions in this mode.

+
+

icase

+
+

Specifies that matching of regular expressions against a character container + sequence shall be performed without regard to case.

+
+

nosubs

+
+

Specifies that when a regular expression is matched against a character + container sequence, then no sub-expression matches are to be stored in the + supplied match_results structure.

+
+

optimize

+
+

Specifies that the regular expression engine should pay more attention to the + speed with which regular expressions are matched, and less to the speed with + which regular expression objects are constructed. Otherwise it has no + detectable effect on the program output.  This currently has no effect for + boost.regex.

+
+

collate

+
+

Specifies that character ranges of the form "[a-b]" should be locale sensitive.

+
+

ECMAScript

+
+

The same as normal.

+
+

JavaScript

+
+

The same as normal.

+
+

JScript

+
+

The same as normal.

+
+

basic

+
+

Specifies that the grammar recognized by the regular expression engine is the + same as that used by POSIX basic regular expressions in IEEE Std 1003.1-2001, + Portable Operating System Interface (POSIX ), Base Definitions and Headers, + Section 9, Regular Expressions (FWD.1). +

+
+

extended

+
+

Specifies that the grammar recognized by the regular expression engine is the + same as that used by POSIX extended regular expressions in IEEE Std + 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and + Headers, Section 9, Regular Expressions (FWD.1).

+
+

awk

+
+

Specifies that the grammar recognized by the regular expression engine is the + same as that used by POSIX utility awk in IEEE Std 1003.1-2001, Portable + Operating System Interface (POSIX ), Shells and Utilities, Section 4, awk + (FWD.1).

+

That is to say: the same as POSIX extended syntax, but with escape sequences in + character classes permitted.

+
+

grep

+
+

Specifies that the grammar recognized by the regular expression engine is the + same as that used by POSIX utility grep in IEEE Std 1003.1-2001, Portable + Operating System Interface (POSIX ), Shells and Utilities, Section 4, + Utilities, grep (FWD.1).

+

That is to say, the same as POSIX basic syntax, but with the newline character + acting as an alternation character in addition to "|".

+
+

egrep

+
+

Specifies that the grammar recognized by the regular expression engine is the + same as that used by POSIX utility grep when given the -E option in IEEE Std + 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and + Utilities, Section 4, Utilities, grep (FWD.1).

+

That is to say, the same as POSIX extended syntax, but with the newline + character acting as an alternation character in addition to "|".

+
+

sed

+
+

The same as basic.

+
+

perl

+
+

The same as normal.

+
+

+

The following constants are specific to this particular regular expression + implementation and do not appear in the + regular expression standardization proposal:

+

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
regbase::escape_in_listsAllows the use of the escape "\" character in sets of + characters, for example [\]] represents the set of characters containing only + "]". If this flag is not set then "\" is an ordinary character inside sets.
regbase::char_classesWhen this bit is set, character classes [:classname:] + are allowed inside character set declarations, for example "[[:word:]]" + represents the set of all characters that belong to the character class "word".
regbase:: intervalsWhen this bit is set, repetition intervals are + allowed, for example "a{2,4}" represents a repeat of between 2 and 4 letter + a's.
regbase:: limited_opsWhen this bit is set all of "+", "?" and "|" are + ordinary characters in all situations.
regbase:: newline_altWhen this bit is set, then the newline character "\n" + has the same effect as the alternation operator "|".
regbase:: bk_plus_qmWhen this bit is set then "\+" represents the one or + more repetition operator and "\?" represents the zero or one repetition + operator. When this bit is not set then "+" and "?" are used instead.
regbase:: bk_bracesWhen this bit is set then "\{" and "\}" are used for + bounded repetitions and "{" and "}" are normal characters. This is the opposite + of default behavior.
regbase:: bk_parensWhen this bit is set then "\(" and "\)" are used to + group sub-expressions and "(" and ")" are ordinary characters, this is the + opposite of default behavior.
regbase:: bk_refsWhen this bit is set then back references are + allowed.
regbase:: bk_vbarWhen this bit is set then "\|" represents the + alternation operator and "|" is an ordinary character. This is the opposite of + default behavior.
regbase:: use_exceptWhen this bit is set then a bad_expression + exception will be thrown on error.  Use of this flag is deprecated - + basic_regex will always throw on error.
regbase:: failbitThis bit is set on error, if regbase::use_except is + not set, then this bit should be checked to see if a regular expression is + valid before usage.
regbase::literalAll characters in the string are treated as literals, + there are no special characters or escape sequences.
regbase::emacsProvides compatability with the emacs + editor, eqivalent to: bk_braces | bk_parens | bk_refs | bk_vbar.
+

+
+

Revised + + 17 May 2003 +

+

© Copyright John Maddock 1998- 2003

+

Permission to use, copy, modify, distribute and sell this software + and its documentation for any purpose is hereby granted without fee, provided + that the above copyright notice appear in all copies and that both that + copyright notice and this permission notice appear in supporting documentation. + Dr John Maddock makes no representations about the suitability of this software + for any purpose. It is provided "as is" without express or implied warranty.

+ + + + diff --git a/doc/thread_safety.html b/doc/thread_safety.html new file mode 100644 index 00000000..eeda681d --- /dev/null +++ b/doc/thread_safety.html @@ -0,0 +1,68 @@ + + + + Boost.Regex: Thread Safety + + + + +

+ + + + + + +
+

C++ Boost

+
+

Boost.Regex

+

Thread Safety

+
+

Boost.Regex Index

+
+

+
+

Class basic_regex<> and its typedefs regex + and wregex are thread safe, in that compiled regular expressions can safely be + shared between threads. The matching algorithms regex_match, + regex_search, regex_grep, + regex_format and regex_merge + are all re-entrant and thread safe. Class match_results + is now thread safe, in that the results of a match can be safely copied from + one thread to another (for example one thread may find matches and push + match_results instances onto a queue, while another thread pops them off the + other end), otherwise use a separate instance of match_results + per thread. +

+

The POSIX API functions are all re-entrant and + thread safe, regular expressions compiled with regcomp can also be + shared between threads. +

+

The class RegEx is only thread safe if each thread + gets its own RegEx instance (apartment threading) - this is a consequence of + RegEx handling both compiling and matching regular expressions. +

+

Finally note that changing the global locale invalidates all compiled regular + expressions, therefore calling set_locale from one thread while another + uses regular expressions will produce unpredictable results. +

+

+ There is also a requirement that there is only one thread executing prior to + the start of main().

+
+

Revised + + 17 May 2003 + +

+

© Copyright John Maddock 1998- 2003

+

Permission to use, copy, modify, distribute and sell this software + and its documentation for any purpose is hereby granted without fee, provided + that the above copyright notice appear in all copies and that both that + copyright notice and this permission notice appear in supporting documentation. + Dr John Maddock makes no representations about the suitability of this software + for any purpose. It is provided "as is" without express or implied warranty.

+ + + diff --git a/doc/uarrow.gif b/doc/uarrow.gif new file mode 100644 index 00000000..6afd20c3 Binary files /dev/null and b/doc/uarrow.gif differ diff --git a/doc/vc71-performance.html b/doc/vc71-performance.html new file mode 100644 index 00000000..2478065d --- /dev/null +++ b/doc/vc71-performance.html @@ -0,0 +1,705 @@ + + + + Regular Expression Performance Comparison (Visual Studio.NET 2003) + + + + + + + +

Regular Expression Performance Comparison

+

The following tables provide comparisons between the following regular + expression libraries:

+

GRETA.

+

The Boost regex library.

+

Henry Spencer's regular expression library + - this is provided for comparison as a typical non-backtracking implementation.

+

Philip Hazel's PCRE library.

+

Details

+

Machine: Intel Pentium 4 2.8GHz PC.

+

Compiler: Microsoft Visual C++ version 7.1.

+

C++ Standard Library: Dinkumware standard library version 313.

+

OS: Win32.

+

Boost version: 1.31.0.

+

PCRE version: 3.9.

+

As ever care should be taken in interpreting the results, only sensible regular + expressions (rather than pathological cases) are given, most are taken from the + Boost regex examples, or from the Library of + Regular Expressions. In addition, some variation in the relative + performance of these libraries can be expected on other machines - as memory + access and processor caching effects can be quite large for most finite state + machine algorithms.  In each case the first figure given is the relative + time taken (so a value of 1.0 is as good as it gets), while the second figure + is the actual time taken.

+

Averages

+

The following are the average relative scores for all the tests: the perfect + regular expression library would score 1, in practice anything less than 2 + is pretty good.

+ + + + + + + + + + + + + + + + + +
GRETAGRETA
+ (non-recursive mode)
BoostBoost + C++ localePOSIXPCRE
6.9066923.7511.625531.38213110.9731.69371
+
+
+

Comparison 1: Long Search

+

For each of the following regular expressions the time taken to find all + occurrences of the expression within a long English language text was measured + (mtent12.txt + from Project Gutenberg, 19Mb). 

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ExpressionGRETAGRETA
+ (non-recursive mode)
BoostBoost + C++ localePOSIXPCRE
Twain19.7
+ (0.541s)
85.5
+ (2.35s)
3.09
+ (0.0851s)
3.09
+ (0.0851s)
131
+ (3.6s)
1
+ (0.0275s)
Huck[[:alpha:]]+11
+ (0.55s)
93.4
+ (4.68s)
3.4
+ (0.17s)
3.35
+ (0.168s)
124
+ (6.19s)
1
+ (0.0501s)
[[:alpha:]]+ing11.3
+ (6.82s)
21.3
+ (12.8s)
1.83
+ (1.1s)
1
+ (0.601s)
6.47
+ (3.89s)
4.75
+ (2.85s)
^[^ ]*?Twain5.75
+ (1.15s)
17.1
+ (3.43s)
1
+ (0.2s)
1.3
+ (0.26s)
NA3.8
+ (0.761s)
Tom|Sawyer|Huckleberry|Finn28.5
+ (3.1s)
77.2
+ (8.4s)
2.3
+ (0.251s)
1
+ (0.109s)
191
+ (20.8s)
1.77
+ (0.193s)
(Tom|Sawyer|Huckleberry|Finn).{0,30}river|river.{0,30}(Tom|Sawyer|Huckleberry|Finn)16.2
+ (4.14s)
49
+ (12.5s)
1.65
+ (0.42s)
1
+ (0.255s)
NA2.43
+ (0.62s)
+
+
+

Comparison 2: Medium Sized Search

+

For each of the following regular expressions the time taken to find all + occurrences of the expression within a medium sized English language text was + measured (the first 50K from mtent12.txt). 

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ExpressionGRETAGRETA
+ (non-recursive mode)
BoostBoost + C++ localePOSIXPCRE
Twain9.49
+ (0.00274s)
40.7
+ (0.0117s)
1.54
+ (0.000445s)
1.56
+ (0.00045s)
13.5
+ (0.00391s)
1
+ (0.000289s)
Huck[[:alpha:]]+14.3
+ (0.0027s)
62.3
+ (0.0117s)
2.26
+ (0.000425s)
2.29
+ (0.000431s)
1.27
+ (0.000239s)
1
+ (0.000188s)
[[:alpha:]]+ing7.34
+ (0.0178s)
13.7
+ (0.0331s)
1
+ (0.00243s)
1.02
+ (0.00246s)
7.36
+ (0.0178s)
5.87
+ (0.0142s)
^[^ ]*?Twain8.34
+ (0.00579s)
24.8
+ (0.0172s)
1.52
+ (0.00105s)
1
+ (0.000694s)
NA2.81
+ (0.00195s)
Tom|Sawyer|Huckleberry|Finn12.9
+ (0.00781s)
35.1
+ (0.0213s)
1.67
+ (0.00102s)
1
+ (0.000606s)
81.5
+ (0.0494s)
1.94
+ (0.00117s)
(Tom|Sawyer|Huckleberry|Finn).{0,30}river|river.{0,30}(Tom|Sawyer|Huckleberry|Finn)15.6
+ (0.0106s)
46.6
+ (0.0319s)
2.72
+ (0.00186s)
1
+ (0.000684s)
311
+ (0.213s)
1.72
+ (0.00117s)
+
+
+

Comparison 3: C++ Code Search

+

For each of the following regular expressions the time taken to find all + occurrences of the expression within the C++ source file + boost/crc.hpp was measured. 

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ExpressionGRETAGRETA
+ (non-recursive mode)
BoostBoost + C++ localePOSIXPCRE
^(template[[:space:]]*<[^;:{]+>[[:space:]]*)?(class|struct)[[:space:]]*(\<\w+\>([ + ]*\([^)]*\))?[[:space:]]*)*(\<\w*\>)[[:space:]]*(<[^;:{]+>[[:space:]]*)?(\{|:[^;\{()]*\{)8.88
+ (0.000792s)
46.4
+ (0.00414s)
1.19
+ (0.000106s)
1
+ (8.92e-005s)
688
+ (0.0614s)
3.23
+ (0.000288s)
(^[ + ]*#(?:[^\\\n]|\\[^\n_[:punct:][:alnum:]]*[\n[:punct:][:word:]])*)|(//[^\n]*|/\*.*?\*/)|\<([+-]?(?:(?:0x[[:xdigit:]]+)|(?:(?:[[:digit:]]*\.)?[[:digit:]]+(?:[eE][+-]?[[:digit:]]+)?))u?(?:(?:int(?:8|16|32|64))|L)?)\>|('(?:[^\\']|\\.)*'|"(?:[^\\"]|\\.)*")|\<(__asm|__cdecl|__declspec|__export|__far16|__fastcall|__fortran|__import|__pascal|__rtti|__stdcall|_asm|_cdecl|__except|_export|_far16|_fastcall|__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto|bool|break|case|catch|cdecl|char|class|const|const_cast|continue|default|delete|do|double|dynamic_cast|else|enum|explicit|extern|false|float|for|friend|goto|if|inline|int|long|mutable|namespace|new|operator|pascal|private|protected|public|register|reinterpret_cast|return|short|signed|sizeof|static|static_cast|struct|switch|template|this|throw|true|try|typedef|typeid|typename|union|unsigned|using|virtual|void|volatile|wchar_t|while)\>1
+ (0.00571s)
5.31
+ (0.0303s)
2.47
+ (0.0141s)
1.92
+ (0.011s)
NA3.29
+ (0.0188s)
^[ ]*#[ ]*include[ ]+("[^"]+"|<[^>]+>)5.78
+ (0.00172s)
26.3
+ (0.00783s)
1.12
+ (0.000333s)
1
+ (0.000298s)
128
+ (0.0382s)
1.74
+ (0.000518s)
^[ ]*#[ ]*include[ ]+("boost/[^"]+"|<boost/[^>]+>)10.2
+ (0.00305s)
28.4
+ (0.00845s)
1.12
+ (0.000333s)
1
+ (0.000298s)
155
+ (0.0463s)
1.74
+ (0.000519s)
+
+

+

Comparison 4: HTML Document Search +

+

For each of the following regular expressions the time taken to find all + occurrences of the expression within the html file libs/libraries.htm + was measured. 

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ExpressionGRETAGRETA
+ (non-recursive mode)
BoostBoost + C++ localePOSIXPCRE
beman|john|dave11
+ (0.00297s)
34.3
+ (0.00922s)
1.78
+ (0.000479s)
1
+ (0.000269s)
55.2
+ (0.0149s)
1.85
+ (0.000499s)
<p>.*?</p>5.38
+ (0.00145s)
21.8
+ (0.00587s)
1.02
+ (0.000274s)
1
+ (0.000269s)
NA1.05
+ (0.000283s)
<a[^>]+href=("[^"]*"|[^[:space:]]+)[^>]*>4.51
+ (0.00207s)
12.6
+ (0.00579s)
1.34
+ (0.000616s)
1
+ (0.000459s)
343
+ (0.158s)
1.09
+ (0.000499s)
<h[12345678][^>]*>.*?</h[12345678]>7.39
+ (0.00143s)
29.6
+ (0.00571s)
1.87
+ (0.000362s)
1
+ (0.000193s)
NA1.27
+ (0.000245s)
<img[^>]+src=("[^"]*"|[^[:space:]]+)[^>]*>6.73
+ (0.00145s)
27.3
+ (0.00587s)
1.2
+ (0.000259s)
1.32
+ (0.000283s)
148
+ (0.0319s)
1
+ (0.000215s)
<font[^>]+face=("[^"]*"|[^[:space:]]+)[^>]*>.*?</font>6.93
+ (0.00153s)
27
+ (0.00595s)
1.22
+ (0.000269s)
1.31
+ (0.000289s)
NA1
+ (0.00022s)
+
+
+

Comparison 3: Simple Matches

+

For each of the following regular expressions the time taken to match against + the text indicated was measured. 

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ExpressionTextGRETAGRETA
+ (non-recursive mode)
BoostBoost + C++ localePOSIXPCRE
abcabc1.31
+ (2.2e-007s)
1.94
+ (3.25e-007s)
1.26
+ (2.1e-007s)
1.24
+ (2.08e-007s)
3.03
+ (5.06e-007s)
1
+ (1.67e-007s)
^([0-9]+)(\-| |$)(.*)$100- this is a line of ftp response which contains a message string1.52
+ (6.88e-007s)
2.28
+ (1.03e-006s)
1.5
+ (6.78e-007s)
1.5
+ (6.78e-007s)
329
+ (0.000149s)
1
+ (4.53e-007s)
([[:digit:]]{4}[- ]){3}[[:digit:]]{3,4}1234-5678-1234-4562.04
+ (1.03e-006s)
2.83
+ (1.43e-006s)
2.12
+ (1.07e-006s)
2.04
+ (1.03e-006s)
30.8
+ (1.56e-005s)
1
+ (5.05e-007s)
^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$john_maddock@compuserve.com1.48
+ (1.78e-006s)
2.1
+ (2.52e-006s)
1.35
+ (1.62e-006s)
1.32
+ (1.59e-006s)
165
+ (0.000198s)
1
+ (1.2e-006s)
^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$foo12@foo.edu1.28
+ (1.41e-006s)
1.9
+ (2.1e-006s)
1.42
+ (1.57e-006s)
1.38
+ (1.53e-006s)
107
+ (0.000119s)
1
+ (1.11e-006s)
^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$bob.smith@foo.tv1.29
+ (1.43e-006s)
1.9
+ (2.1e-006s)
1.42
+ (1.57e-006s)
1.38
+ (1.53e-006s)
119
+ (0.000132s)
1
+ (1.11e-006s)
^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$EH10 2QQ1.26
+ (4.63e-007s)
1.77
+ (6.49e-007s)
1.3
+ (4.77e-007s)
1.2
+ (4.4e-007s)
9.15
+ (3.36e-006s)
1
+ (3.68e-007s)
^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$G1 1AA1.06
+ (4.73e-007s)
1.59
+ (7.07e-007s)
1.05
+ (4.68e-007s)
1
+ (4.44e-007s)
12.9
+ (5.73e-006s)
1.63
+ (7.26e-007s)
^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$SW1 1ZZ1.26
+ (9.17e-007s)
1.84
+ (1.34e-006s)
1.28
+ (9.26e-007s)
1.21
+ (8.78e-007s)
8.42
+ (6.11e-006s)
1
+ (7.26e-007s)
^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$4/1/20011.57
+ (9.73e-007s)
2.28
+ (1.41e-006s)
1.25
+ (7.73e-007s)
1.26
+ (7.83e-007s)
11.2
+ (6.95e-006s)
1
+ (6.21e-007s)
^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$12/12/20011.52
+ (9.56e-007s)
2.06
+ (1.3e-006s)
1.29
+ (8.12e-007s)
1.24
+ (7.83e-007s)
12.4
+ (7.8e-006s)
1
+ (6.3e-007s)
^[-+]?[[:digit:]]*\.?[[:digit:]]*$1232.11
+ (7.35e-007s)
3.18
+ (1.11e-006s)
2.5
+ (8.7e-007s)
2.44
+ (8.5e-007s)
5.26
+ (1.83e-006s)
1
+ (3.49e-007s)
^[-+]?[[:digit:]]*\.?[[:digit:]]*$+3.141591.31
+ (4.96e-007s)
1.92
+ (7.26e-007s)
1.26
+ (4.77e-007s)
1.2
+ (4.53e-007s)
9.71
+ (3.66e-006s)
1
+ (3.77e-007s)
^[-+]?[[:digit:]]*\.?[[:digit:]]*$-3.141591.32
+ (4.97e-007s)
1.92
+ (7.26e-007s)
1.24
+ (4.67e-007s)
1.2
+ (4.53e-007s)
9.7
+ (3.66e-006s)
1
+ (3.78e-007s)
+
+
+
+

Copyright John Maddock April 2003, all rights reserved.

+ + diff --git a/example/snippets/regex_iterator_example.cpp b/example/snippets/regex_iterator_example.cpp new file mode 100644 index 00000000..6ec3d85e --- /dev/null +++ b/example/snippets/regex_iterator_example.cpp @@ -0,0 +1,115 @@ +/* + * + * Copyright (c) 2003 + * Dr John Maddock + * + * Permission to use, copy, modify, distribute and sell this software + * and its documentation for any purpose is hereby granted without fee, + * provided that the above copyright notice appear in all copies and + * that both that copyright notice and this permission notice appear + * in supporting documentation. Dr John Maddock makes no representations + * about the suitability of this software for any purpose. + * It is provided "as is" without express or implied warranty. + * + */ + + /* + * LOCATION: see http://www.boost.org for most recent version. + * FILE regex_iterator_example_2.cpp + * VERSION see + * DESCRIPTION: regex_iterator example 2: searches a cpp file for class definitions, + * using global data. + */ + +#include +#include +#include +#include +#include + +using namespace std; + +// purpose: +// takes the contents of a file in the form of a string +// and searches for all the C++ class definitions, storing +// their locations in a map of strings/int's + +typedef std::map > map_type; + +const char* re = + // possibly leading whitespace: + "^[[:space:]]*" + // possible template declaration: + "(template[[:space:]]*<[^;:{]+>[[:space:]]*)?" + // class or struct: + "(class|struct)[[:space:]]*" + // leading declspec macros etc: + "(" + "\\<\\w+\\>" + "(" + "[[:blank:]]*\\([^)]*\\)" + ")?" + "[[:space:]]*" + ")*" + // the class name + "(\\<\\w*\\>)[[:space:]]*" + // template specialisation parameters + "(<[^;:{]+>)?[[:space:]]*" + // terminate in { or : + "(\\{|:[^;\\{()]*\\{)"; + + +boost::regex expression(re); +map_type class_index; + +bool regex_callback(const boost::match_results& what) +{ + // what[0] contains the whole string + // what[5] contains the class name. + // what[6] contains the template specialisation if any. + // add class name and position to map: + class_index[what[5].str() + what[6].str()] = what.position(5); + return true; +} + +void load_file(std::string& s, std::istream& is) +{ + s.erase(); + s.reserve(is.rdbuf()->in_avail()); + char c; + while(is.get(c)) + { + if(s.capacity() == s.size()) + s.reserve(s.capacity() * 3); + s.append(1, c); + } +} + +int main(int argc, const char** argv) +{ + std::string text; + for(int i = 1; i < argc; ++i) + { + cout << "Processing file " << argv[i] << endl; + std::ifstream fs(argv[i]); + load_file(text, fs); + // construct our iterators: + boost::regex_iterator m1(text.begin(), text.end(), expression); + boost::regex_iterator m2; + std::for_each(m1, m2, ®ex_callback); + // copy results: + cout << class_index.size() << " matches found" << endl; + map_type::iterator c, d; + c = class_index.begin(); + d = class_index.end(); + while(c != d) + { + cout << "class \"" << (*c).first << "\" found at index: " << (*c).second << endl; + ++c; + } + class_index.erase(class_index.begin(), class_index.end()); + } + return 0; +} + + diff --git a/example/snippets/regex_replace_example.cpp b/example/snippets/regex_replace_example.cpp new file mode 100644 index 00000000..b00345ff --- /dev/null +++ b/example/snippets/regex_replace_example.cpp @@ -0,0 +1,138 @@ +/* + * + * Copyright (c) 1998-2002 + * Dr John Maddock + * + * Permission to use, copy, modify, distribute and sell this software + * and its documentation for any purpose is hereby granted without fee, + * provided that the above copyright notice appear in all copies and + * that both that copyright notice and this permission notice appear + * in supporting documentation. Dr John Maddock makes no representations + * about the suitability of this software for any purpose. + * It is provided "as is" without express or implied warranty. + * + */ + + /* + * LOCATION: see http://www.boost.org for most recent version. + * FILE regex_replace_example.cpp + * VERSION see + * DESCRIPTION: regex_replace example: + * converts a C++ file to syntax highlighted HTML. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +// purpose: +// takes the contents of a file and transform to +// syntax highlighted code in html format + +boost::regex e1, e2; +extern const char* expression_text; +extern const char* format_string; +extern const char* pre_expression; +extern const char* pre_format; +extern const char* header_text; +extern const char* footer_text; + +void load_file(std::string& s, std::istream& is) +{ + s.erase(); + s.reserve(is.rdbuf()->in_avail()); + char c; + while(is.get(c)) + { + if(s.capacity() == s.size()) + s.reserve(s.capacity() * 3); + s.append(1, c); + } +} + +int main(int argc, const char** argv) +{ + try{ + e1.assign(expression_text); + e2.assign(pre_expression); + for(int i = 1; i < argc; ++i) + { + std::cout << "Processing file " << argv[i] << std::endl; + std::ifstream fs(argv[i]); + std::string in; + load_file(in, fs); + std::string out_name = std::string(argv[i]) + std::string(".htm"); + std::ofstream os(out_name.c_str()); + os << header_text; + // strip '<' and '>' first by outputting to a + // temporary string stream + std::ostringstream t(std::ios::out | std::ios::binary); + std::ostream_iterator oi(t); + boost::regex_replace(oi, in.begin(), in.end(), e2, pre_format, boost::match_default | boost::format_all); + // then output to final output stream + // adding syntax highlighting: + std::string s(t.str()); + std::ostream_iterator out(os); + boost::regex_replace(out, s.begin(), s.end(), e1, format_string, boost::match_default | boost::format_all); + os << footer_text; + } + } + catch(...) + { return -1; } + return 0; +} + +extern const char* pre_expression = "(<)|(>)|\\r"; +extern const char* pre_format = "(?1<)(?2>)"; + + +const char* expression_text = // preprocessor directives: index 1 + "(^[[:blank:]]*#(?:[^\\\\\\n]|\\\\[^\\n[:punct:][:word:]]*[\\n[:punct:][:word:]])*)|" + // comment: index 2 + "(//[^\\n]*|/\\*.*?\\*/)|" + // literals: index 3 + "\\<([+-]?(?:(?:0x[[:xdigit:]]+)|(?:(?:[[:digit:]]*\\.)?[[:digit:]]+(?:[eE][+-]?[[:digit:]]+)?))u?(?:(?:int(?:8|16|32|64))|L)?)\\>|" + // string literals: index 4 + "('(?:[^\\\\']|\\\\.)*'|\"(?:[^\\\\\"]|\\\\.)*\")|" + // keywords: index 5 + "\\<(__asm|__cdecl|__declspec|__export|__far16|__fastcall|__fortran|__import" + "|__pascal|__rtti|__stdcall|_asm|_cdecl|__except|_export|_far16|_fastcall" + "|__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto|bool" + "|break|case|catch|cdecl|char|class|const|const_cast|continue|default|delete" + "|do|double|dynamic_cast|else|enum|explicit|extern|false|float|for|friend|goto" + "|if|inline|int|long|mutable|namespace|new|operator|pascal|private|protected" + "|public|register|reinterpret_cast|return|short|signed|sizeof|static|static_cast" + "|struct|switch|template|this|throw|true|try|typedef|typeid|typename|union|unsigned" + "|using|virtual|void|volatile|wchar_t|while)\\>" + ; + +const char* format_string = "(?1$&)" + "(?2$&)" + "(?3$&)" + "(?4$&)" + "(?5$&)"; + +const char* header_text = "\n\n" + "Auto-generated html formated source\n" + "\n" + "\n" + "\n" + "

\n
";
+
+const char* footer_text = "
\n\n\n"; + + + + + + + + + + + diff --git a/example/snippets/regex_token_iterator_example_1.cpp b/example/snippets/regex_token_iterator_example_1.cpp new file mode 100644 index 00000000..8ba8dcb5 --- /dev/null +++ b/example/snippets/regex_token_iterator_example_1.cpp @@ -0,0 +1,75 @@ +/* + * + * Copyright (c) 12003 + * Dr John Maddock + * + * Permission to use, copy, modify, distribute and sell this software + * and its documentation for any purpose is hereby granted without fee, + * provided that the above copyright notice appear in all copies and + * that both that copyright notice and this permission notice appear + * in supporting documentation. Dr John Maddock makes no representations + * about the suitability of this software for any purpose. + * It is provided "as is" without express or implied warranty. + * + */ + + /* + * LOCATION: see http://www.boost.org for most recent version. + * FILE regex_token_iterator_example_1.cpp + * VERSION see + * DESCRIPTION: regex_token_iterator example: split a string into tokens. + */ + + +#include + +#include +using namespace std; + + +#if defined(BOOST_MSVC) || (defined(__BORLANDC__) && (__BORLANDC__ == 0x550)) +// +// problem with std::getline under MSVC6sp3 +istream& getline(istream& is, std::string& s) +{ + s.erase(); + char c = is.get(); + while(c != '\n') + { + s.append(1, c); + c = is.get(); + } + return is; +} +#endif + + +int main(int argc) +{ + string s; + do{ + if(argc == 1) + { + cout << "Enter text to split (or \"quit\" to exit): "; + getline(cin, s); + if(s == "quit") break; + } + else + s = "This is a string of tokens"; + + boost::regex re("\\s+"); + boost::regex_token_iterator i(s.begin(), s.end(), re, -1); + boost::regex_token_iterator j; + + unsigned count = 0; + while(i != j) + { + cout << *i++ << endl; + count++; + } + cout << "There were " << count << " tokens found." << endl; + + }while(argc == 1); + return 0; +} + diff --git a/example/snippets/regex_token_iterator_example_2.cpp b/example/snippets/regex_token_iterator_example_2.cpp new file mode 100644 index 00000000..71b2188b --- /dev/null +++ b/example/snippets/regex_token_iterator_example_2.cpp @@ -0,0 +1,92 @@ +/* + * + * Copyright (c) 2003 + * Dr John Maddock + * + * Permission to use, copy, modify, distribute and sell this software + * and its documentation for any purpose is hereby granted without fee, + * provided that the above copyright notice appear in all copies and + * that both that copyright notice and this permission notice appear + * in supporting documentation. Dr John Maddock makes no representations + * about the suitability of this software for any purpose. + * It is provided "as is" without express or implied warranty. + * + */ + + /* + * LOCATION: see http://www.boost.org for most recent version. + * FILE regex_token_iterator_example_2.cpp + * VERSION see + * DESCRIPTION: regex_token_iterator example: spit out linked URL's. + */ + + +#include +#include +#include +#include + +boost::regex e("<\\s*A\\s+[^>]*href\\s*=\\s*\"([^\"]*)\"", + boost::regex::normal | boost::regbase::icase); + +void load_file(std::string& s, std::istream& is) +{ + s.erase(); + // + // attempt to grow string buffer to match file size, + // this doesn't always work... + s.reserve(is.rdbuf()->in_avail()); + char c; + while(is.get(c)) + { + // use logarithmic growth stategy, in case + // in_avail (above) returned zero: + if(s.capacity() == s.size()) + s.reserve(s.capacity() * 3); + s.append(1, c); + } +} + +int main(int argc, char** argv) +{ + std::string s; + int i; + for(i = 1; i < argc; ++i) + { + std::cout << "Findings URL's in " << argv[i] << ":" << std::endl; + s.erase(); + std::ifstream is(argv[i]); + load_file(s, is); + boost::regex_token_iterator + i(s.begin(), s.end(), e, 1); + boost::regex_token_iterator j; + while(i != j) + { + std::cout << *i++ << std::endl; + } + } + // + // alternative method: + // test the array-literal constructor, and split out the whole + // match as well as $1.... + // + for(i = 1; i < argc; ++i) + { + std::cout << "Findings URL's in " << argv[i] << ":" << std::endl; + s.erase(); + std::ifstream is(argv[i]); + load_file(s, is); + const int subs[] = {1, 0,}; + boost::regex_token_iterator + i(s.begin(), s.end(), e, subs); + boost::regex_token_iterator j; + while(i != j) + { + std::cout << *i++ << std::endl; + } + } + + return 0; +} + + diff --git a/faq.htm b/faq.htm deleted file mode 100644 index fb3795b6..00000000 --- a/faq.htm +++ /dev/null @@ -1,205 +0,0 @@ - - - - - - -Regex++ - FAQ - - - - -

 

- - - - - - -

C++ Boost

-

Regex++, FAQ.

-

Copyright (c) 1998-2001

-

Dr John Maddock

-

Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty.

-
- -

Q. Why does using parenthesis in a -regular expression change the result of a match?

- -

Parentheses don't only mark; they determine what the best -match is as well. regex++ tries to follow the POSIX standard -leftmost longest rule for determining what matched. So if there -is more than one possible match after considering the whole -expression, it looks next at the first sub-expression and then -the second sub-expression and so on. So...

- -
"(0*)([0-9]*)" against "00123" would produce
-$1 = "00"
-$2 = "123"
- -

where as

- -
"0*([0-9)*" against "00123" would produce
-$1 = "00123"
- -

If you think about it, had $1 only matched the "123", -this would be "less good" than the match "00123" -which is both further to the left and longer. If you want $1 to -match only the "123" part, then you need to use -something like:

- -
"0*([1-9][0-9]*)"
- -

as the expression.

- -

Q. Configure says that my compiler is -unable to merge template instances, what does this mean?

- -

A. When you compile template code, you can end up with the -same template instances in multiple translation units - this will -lead to link time errors unless your compiler/linker is smart -enough to merge these template instances into a single record in -the executable file. If you see this warning after running -configure, then you can still link to libregex++.a if:

- -
    -
  1. You use only the low-level template classes (reg_expression<> - match_results<> etc), from a single translation - unit, and use no other part of regex++.
  2. -
  3. You use only the POSIX API functions (regcomp regexec etc), - and no other part of regex++.
  4. -
  5. You use only the high level class RegEx, and no other - part of regex++.
  6. -
- -

Another option is to create a master include file, which -#include's all the regex++ source files, and all the source files -in which you use regex++. You then compile and link this master -file as a single translation unit.

- -

Q. Configure says that my compiler is -unable to merge template instances from archive files, what does -this mean?

- -

A. When you compile template code, you can end up with the -same template instances in multiple translation units - this will -lead to link time errors unless your compiler/linker is smart -enough to merge these template instances into a single record in -the executable file. Some compilers are able to do this for -normal .cpp or .o files, but fail if the object file has been -placed in a library archive. If you see this warning after -running configure, then you can still link to libregex++.a if:

- -
    -
  1. You use only the low-level template classes (reg_expression<> - match_results<> etc), and use no other part of - regex++.
  2. -
  3. You use only the POSIX API functions (regcomp regexec etc), - and no other part of regex++.
  4. -
  5. You use only the high level class RegEx, and no other - part of regex++.
  6. -
- -

Another option is to add the regex++ source files directly to -your project instead of linking to libregex++.a, generally you -should do this only if you are getting link time errors with -libregex++.a.

- -

Q. Configure says that my compiler can't -merge templates containing switch statements, what does this -mean?

- -

A. Some compilers can't merge templates that contain static -data - this includes switch statements which implicitly generate -static data as well as code. Principally this affects the egcs -compiler - but note gcc 2.81 also suffers from this problem - the -compiler will compile and link the code - but the code will not -run because the code and the static data it uses have become -separated. The default behaviour of regex++ is to try and fix -this problem by declaring "problem" templates inside -unnamed namespaces, so that the templates have internal linkage. -Note that this can result in a great deal of code bloat. If the -compiler doesn't support namespaces, or if code bloat becomes a -problem, then follow the guidelines above for placing all the -templates used in a single translation unit, and edit boost/regex/config.hpp -so that BOOST_REGEX_NO_TEMPLATE_SWITCH_MERGE is no longer defined. -

- -

Q. I can't get regex++ to work with -escape characters, what's going on?

- -

A. If you embed regular expressions in C++ code, then remember -that escape characters are processed twice: once by the C++ -compiler, and once by the regex++ expression compiler, so to pass -the regular expression \d+ to regex++, you need to embed "\\d+" -in your code. Likewise to match a literal backslash you will need -to embed "\\\\" in your code.

- -

Q. Why don't character ranges work -properly?
-A. The POSIX standard specifies that character range expressions -are locale sensitive - so for example the expression [A-Z] will -match any collating element that collates between 'A' and 'Z'. -That means that for most locales other than "C" or -"POSIX", [A-Z] would match the single character 't' for -example, which is not what most people expect - or at least not -what most people have come to expect from regular expression -engines. For this reason, the default behaviour of regex++ is to -turn locale sensitive collation off by setting the regbase::nocollate -compile time flag (this is set by regbase::normal). However if -you set a non-default compile time flag - for example regbase::extended -or regbase::basic, then locale dependent collation will be -enabled, this also applies to the POSIX API functions which use -either regbase::extended or regbase::basic internally, in the -latter case use REG_NOCOLLATE in combination with either -REG_BASIC or REG_EXTENDED when invoking regcomp if you don't want -locale sensitive collation. [Note - when regbase::nocollate in -effect, the library behaves "as if" the LC_COLLATE -locale category were always "C", regardless of what its -actually set to - end note].

- -

 Q. Why can't I use the "convenience" -versions of query_match/reg_search/reg_grep/reg_format/reg_merge? -

- -

A. These versions may or may not be available depending upon -the capabilities of your compiler, the rules determining the -format of these functions are quite complex - and only the -versions visible to a standard compliant compiler are given in -the help. To find out what your compiler supports, run <boost/regex.hpp> -through your C++ pre-processor, and search the output file for -the function that you are interested in.

- -

Q. Why are there no throw specifications -on any of the functions? What exceptions can the library throw? -

- -

A. Not all compilers support (or honor) throw specifications, -others support them but with reduced efficiency. Throw -specifications may be added at a later date as compilers begin to -handle this better. The library should throw only three types of -exception: boost::bad_expression can be thrown by reg_expression -when compiling a regular expression, std::runtime_error can be -thrown when a call to reg_expression::imbue tries to open a -message catalogue that doesn't exist or when a call to RegEx::GrepFiles -or RegEx::FindFiles tries to open a file that cannot be opened, -finally std::bad_alloc can be thrown by just about any of the -functions in this library.

- -
- -

Copyright Dr -John Maddock 1998-2000 all rights reserved.

- - diff --git a/format_string.htm b/format_string.htm deleted file mode 100644 index 41a33842..00000000 --- a/format_string.htm +++ /dev/null @@ -1,243 +0,0 @@ - - - - - - -Regex++, Format String Reference - - - - -

 

- - - - - - -

C++ Boost

-

Regex++, Format - String Reference.

-

Copyright (c) 1998-2001

-

Dr John Maddock

-

Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty.

-
- -
- -

Format String Syntax

- -

Format strings are used by the algorithms regex_format and regex_merge, and are -used to transform one string into another.

- -

There are three kind of format string: sed, perl and extended, -the extended syntax is the default so this is covered first.

- -

Extended format syntax

- -

In format strings, all characters are treated as literals -except: ()$\?:

- -

To use any of these as literals you must prefix them with the -escape character \

- -

The following special sequences are recognized:

- -

Grouping:

- -

Use the parenthesis characters ( and ) to group sub-expressions -within the format string, use \( and \) to represent literal '(' -and ')'.

- -

Sub-expression expansions:

- -

The following perl like expressions expand to a particular -matched sub-expression:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 $`Expands to all the text from - the end of the previous match to the start of the current - match, if there was no previous match in the current - operation, then everything from the start of the input - string to the start of the match. 
 $'Expands to all the text from - the end of the match to the end of the input string. 
 $&Expands to all of the - current match. 
 $0Expands to all of the - current match. 
 $NExpands to the text that - matched sub-expression N. 
- -


- -

Conditional expressions:

- -

Conditional expressions allow two different format strings to -be selected dependent upon whether a sub-expression participated -in the match or not:

- -

?Ntrue_expression:false_expression

- -

Executes true_expression if sub-expression N -participated in the match, otherwise executes false_expression.

- -

Example: suppose we search for "(while)|(for)" then -the format string "?1WHILE:FOR" would output what -matched, but in upper case.

- -

Escape sequences:

- -

The following escape sequences are also allowed:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 \aThe bell character. 
 \fThe form feed character. 
 \nThe newline character. 
 \rThe carriage return - character. 
 \tThe tab character. 
 \vA vertical tab character. 
 \xA hexadecimal character - - for example \x0D. 
 \x{}A possible unicode - hexadecimal character - for example \x{1A0} 
 \cxThe ASCII escape character - x, for example \c@ is equivalent to escape-@. 
 \eThe ASCII escape character. 
 \ddAn octal character constant, - for example \10. 
- -


- -

Perl format strings

- -

Perl format strings are the same as the default syntax except -that the characters ()?: have no special meaning.

- -

Sed format strings

- -

Sed format strings use only the characters \ and & as -special characters.

- -

\n where n is a digit, is expanded to the nth sub-expression.

- -

& is expanded to the whole of the match (equivalent to \0). -

- -

Other escape sequences are expanded as per the default syntax. -
-

- -
- -

Copyright Dr -John Maddock 1998-2000 all rights reserved.

- - diff --git a/hl_ref.htm b/hl_ref.htm deleted file mode 100644 index 44b803a1..00000000 --- a/hl_ref.htm +++ /dev/null @@ -1,572 +0,0 @@ - - - - - - -Regex++, RegEx Class Reference - - - - -

 

- - - - - - -

C++ Boost

-

Regex++, RegEx Class - Reference.

-

Copyright (c) 1998-2001

-

Dr John Maddock

-

Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty.

-
- -
- -

Class RegEx

- -

#include <boost/cregex.hpp>

- -

The class RegEx provides a high level simplified interface to -the regular expression library, this class only handles narrow -character strings, and regular expressions always follow the -"normal" syntax - that is the same as the standard -POSIX extended syntax, but with locale specific collation -disabled, and escape characters inside character set declarations -are allowed.

- -
typedef bool (*GrepCallback)(const RegEx& expression);
-typedef bool (*GrepFileCallback)(const char* file, const RegEx& expression);
-typedef bool (*FindFilesCallback)(const char* file);
-
-class  RegEx
-{
-public:
-   RegEx();
-   RegEx(const RegEx& o);
-   ~RegEx();
-   RegEx(const char* c, bool icase = false);
-   explicit RegEx(const std::string& s, bool icase = false);
-   RegEx& operator=(const RegEx& o);
-   RegEx& operator=(const char* p);
-   RegEx& operator=(const std::string& s);
-   unsigned int SetExpression(const char* p, bool icase = false);
-   unsigned int SetExpression(const std::string& s, bool icase = false);
-   std::string Expression()const;
-   //
-   // now matching operators: 
-   // 
-   bool Match(const char* p, unsigned int flags = match_default);
-   bool Match(const std::string& s, unsigned int flags = match_default); 
-   bool Search(const char* p, unsigned int flags = match_default); 
-   bool Search(const std::string& s, unsigned int flags = match_default); 
-   unsigned int Grep(GrepCallback cb, const char* p, unsigned int flags = match_default); 
-   unsigned int Grep(GrepCallback cb, const std::string& s, unsigned int flags = match_default); 
-   unsigned int Grep(std::vector<std::string>& v, const char* p, unsigned int flags = match_default); 
-   unsigned int Grep(std::vector<std::string>& v, const std::string& s, unsigned int flags = match_default); 
-   unsigned int Grep(std::vector<unsigned int>& v, const char* p, unsigned int flags = match_default); 
-   unsigned int Grep(std::vector<unsigned int>& v, const std::string& s, unsigned int flags = match_default); 
-   unsigned int GrepFiles(GrepFileCallback cb, const char* files, bool recurse = false, unsigned int flags = match_default); 
-   unsigned int GrepFiles(GrepFileCallback cb, const std::string& files, bool recurse = false, unsigned int flags = match_default); 
-   unsigned int FindFiles(FindFilesCallback cb, const char* files, bool recurse = false, unsigned int flags = match_default); 
-   unsigned int FindFiles(FindFilesCallback cb, const std::string& files, bool recurse = false, unsigned int flags = match_default); 
-   std::string Merge(const std::string& in, const std::string& fmt, bool copy = true, unsigned int flags = match_default); 
-   std::string Merge(const char* in, const char* fmt, bool copy = true, unsigned int flags = match_default); 
-   unsigned Split(std::vector<std::string>& v, std::string& s, unsigned flags = match_default, unsigned max_count = ~0); 
-   // 
-   // now operators for returning what matched in more detail: 
-   // 
-   unsigned int Position(int i = 0)const; 
-   unsigned int Length(int i = 0)const; 
-   bool Matched(int i = 0)const;
-   unsigned int Line()const; 
-   unsigned int Marks() const; 
-   std::string What(int i)const; 
-   std::string operator[](int i)const ; 
-
-   static const unsigned int npos;
-};     
- -

Member functions for class RegEx are defined as follows:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 RegEx();Default constructor, - constructs an instance of RegEx without any valid - expression. 
 RegEx(const - RegEx& o);Copy constructor, all the - properties of parameter o are copied. 
 RegEx(const char* - c, bool icase = false);Constructs an instance of - RegEx, setting the expression to c, if icase - is true then matching is insensitive to case, - otherwise it is sensitive to case. Throws bad_expression - on failure. 
 RegEx(const std::string& - s, bool icase = false);Constructs an instance of - RegEx, setting the expression to s, if icase is - true then matching is insensitive to case, - otherwise it is sensitive to case. Throws bad_expression - on failure. 
 RegEx& operator=(const - RegEx& o);Default assignment operator. 
 RegEx& operator=(const - char* p);Assignment operator, - equivalent to calling SetExpression(p, false). - Throws bad_expression on failure. 
 RegEx& operator=(const - std::string& s);Assignment operator, - equivalent to calling SetExpression(s, false). - Throws bad_expression on failure. 
 unsigned int - SetExpression(constchar* p, bool icase = false);Sets the current expression - to p, if icase is true then matching - is insensitive to case, otherwise it is sensitive to case. - Throws bad_expression on failure. 
 unsigned int - SetExpression(const std::string& s, bool - icase = false);Sets the current expression - to s, if icase is true then matching - is insensitive to case, otherwise it is sensitive to case. - Throws bad_expression on failure. 
 std::string Expression()const;Returns a copy of the - current regular expression. 
 bool Match(const - char* p, unsigned int flags = - match_default);Attempts to match the - current expression against the text p using the - match flags flags - see match flags. - Returns true if the expression matches the whole - of the input string. 
 bool Match(const - std::string& s, unsigned int flags = - match_default) ;Attempts to match the - current expression against the text s using the - match flags flags - see match flags. - Returns true if the expression matches the whole - of the input string. 
 bool Search(const - char* p, unsigned int flags = - match_default);Attempts to find a match for - the current expression somewhere in the text p - using the match flags flags - see match flags. - Returns true if the match succeeds. 
 bool Search(const - std::string& s, unsigned int flags = - match_default) ;Attempts to find a match for - the current expression somewhere in the text s - using the match flags flags - see match flags. - Returns true if the match succeeds. 
 unsigned int - Grep(GrepCallback cb, const char* p, unsigned - int flags = match_default);Finds all matches of the - current expression in the text p using the match - flags flags - see match flags. - For each match found calls the call-back function cb - as: cb(*this);

If at any stage the call-back function - returns false then the grep operation terminates, - otherwise continues until no further matches are found. - Returns the number of matches found.

-
 
 unsigned int - Grep(GrepCallback cb, const std::string& s, unsigned - int flags = match_default);Finds all matches of the - current expression in the text s using the match - flags flags - see match flags. - For each match found calls the call-back function cb - as: cb(*this);

If at any stage the call-back function - returns false then the grep operation terminates, - otherwise continues until no further matches are found. - Returns the number of matches found.

-
 
 unsigned int - Grep(std::vector<std::string>& v, const char* - p, unsigned int flags = match_default);Finds all matches of the - current expression in the text p using the match - flags flags - see match flags. - For each match pushes a copy of what matched onto v. - Returns the number of matches found. 
 unsigned int - Grep(std::vector<std::string>& v, const - std::string& s, unsigned int flags = - match_default);Finds all matches of the - current expression in the text s using the match - flags flags - see match flags. - For each match pushes a copy of what matched onto v. - Returns the number of matches found. 
 unsigned int - Grep(std::vector<unsigned int>& v, const - char* p, unsigned int flags = - match_default);Finds all matches of the - current expression in the text p using the match - flags flags - see match flags. - For each match pushes the starting index of what matched - onto v. Returns the number of matches found. 
 unsigned int - Grep(std::vector<unsigned int>& v, const - std::string& s, unsigned int flags = - match_default);Finds all matches of the - current expression in the text s using the match - flags flags - see match flags. - For each match pushes the starting index of what matched - onto v. Returns the number of matches found. 
 unsigned int - GrepFiles(GrepFileCallback cb, const char* - files, bool recurse = false, unsigned - int flags = match_default);Finds all matches of the - current expression in the files files using the - match flags flags - see match flags. - For each match calls the call-back function cb. 

If - the call-back returns false then the algorithm returns - without considering further matches in the current file, - or any further files. 

-

The parameter files can include wild card - characters '*' and '?', if the parameter recurse - is true then searches sub-directories for matching file - names. 

-

Returns the total number of matches found.

-

May throw an exception derived from std::runtime_error - if file io fails.

-
 
 unsigned int - GrepFiles(GrepFileCallback cb, const std::string& - files, bool recurse = false, unsigned - int flags = match_default);Finds all matches of the - current expression in the files files using the - match flags flags - see match flags. - For each match calls the call-back function cb. 

If - the call-back returns false then the algorithm returns - without considering further matches in the current file, - or any further files. 

-

The parameter files can include wild card - characters '*' and '?', if the parameter recurse - is true then searches sub-directories for matching file - names. 

-

Returns the total number of matches found.

-

May throw an exception derived from std::runtime_error - if file io fails.

-
 
 unsigned int - FindFiles(FindFilesCallback cb, const char* - files, bool recurse = false, unsigned - int flags = match_default);Searches files to - find all those which contain at least one match of the - current expression using the match flags flags - - see match - flags. For each matching file calls the call-back - function cb. 

If the call-back returns false then - the algorithm returns without considering any further - files. 

-

The parameter files can include wild card - characters '*' and '?', if the parameter recurse - is true then searches sub-directories for matching file - names. 

-

Returns the total number of files found.

-

May throw an exception derived from std::runtime_error - if file io fails.

-
 
 unsigned int - FindFiles(FindFilesCallback cb, const std::string& - files, bool recurse = false, unsigned - int flags = match_default);Searches files to - find all those which contain at least one match of the - current expression using the match flags flags - - see match - flags. For each matching file calls the call-back - function cb. 

If the call-back returns false then - the algorithm returns without considering any further - files. 

-

The parameter files can include wild card - characters '*' and '?', if the parameter recurse - is true then searches sub-directories for matching file - names. 

-

Returns the total number of files found.

-

May throw an exception derived from std::runtime_error - if file io fails.

-
 
 std::string Merge(const - std::string& in, const std::string& fmt, bool - copy = true, unsigned int flags = - match_default);Performs a search and - replace operation: searches through the string in - for all occurrences of the current expression, for each - occurrence replaces the match with the format string fmt. - Uses flags to determine what gets matched, and how - the format string should be treated. If copy is - true then all unmatched sections of input are copied - unchanged to output, if the flag format_first_only - is set then only the first occurance of the pattern found - is replaced. Returns the new string. See also format string - syntax, match - flags and format flags. 
 std::string Merge(const - char* in, const char* fmt, bool copy = true, - unsigned int flags = match_default);Performs a search and - replace operation: searches through the string in - for all occurrences of the current expression, for each - occurrence replaces the match with the format string fmt. - Uses flags to determine what gets matched, and how - the format string should be treated. If copy is - true then all unmatched sections of input are copied - unchanged to output, if the flag format_first_only - is set then only the first occurance of the pattern found - is replaced. Returns the new string. See also format string - syntax, match - flags and format flags. 
 unsigned Split(std::vector<std::string>& - v, std::string& s, unsigned flags = - match_default, unsigned max_count = ~0);Splits the input string and pushes each - one onto the vector. If the expression contains no marked - sub-expressions, then one string is outputted for each - section of the input that does not match the expression. - If the expression does contain marked sub-expressions, - then outputs one string for each marked sub-expression - each time a match occurs. Outputs no more than max_count - strings. Before returning, deletes from the input - string s all of the input that has been processed - (all of the string if max_count was not reached). - Returns the number of strings pushed onto the vector. 
 unsigned int - Position(int i = 0)const;Returns the position of what - matched sub-expression i. If i = 0 then - returns the position of the whole match. Returns RegEx::npos - if the supplied index is invalid, or if the specified sub-expression - did not participate in the match. 
 unsigned int - Length(int i = 0)const;Returns the length of what - matched sub-expression i. If i = 0 then - returns the length of the whole match. Returns RegEx::npos - if the supplied index is invalid, or if the specified sub-expression - did not participate in the match. 
 bool Matched(int i - = 0)const;Returns true if sub-expression i was - matched, false otherwise. 
 unsigned int - Line()const;Returns the line on which - the match occurred, indexes start from 1 not zero, if no - match occurred then returns RegEx::npos. 
 unsigned int Marks() - const;Returns the number of marked - sub-expressions contained in the expression. Note that - this includes the whole match (sub-expression zero), so - the value returned is always >= 1. 
 std::string What(int - i)const;Returns a copy of what - matched sub-expression i. If i = 0 then - returns a copy of the whole match. Returns a null string - if the index is invalid or if the specified sub-expression - did not participate in a match. 
 std::string operator[](int - i)const ;Returns what(i);

Can - be used to simplify access to sub-expression matches, and - make usage more perl-like.

-
 
- -
- -

Copyright Dr -John Maddock 1998-2000 all rights reserved.

- - diff --git a/index.htm b/index.htm deleted file mode 100644 index f313dd7c..00000000 --- a/index.htm +++ /dev/null @@ -1,150 +0,0 @@ - - - - - - - -regex++, Index - - - - -

 

- - - - - - -

C++ Boost

-

Regex++, Index.

-

(Version 3.31, 16th Dec 2001)  -

-

Copyright (c) 1998-2001

-

Dr John Maddock

-

Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty.

-
- -
- -

Contents

- - - -
- -

Copyright Dr -John Maddock 1998-2001 all rights reserved.

- - diff --git a/index.html b/index.html new file mode 100644 index 00000000..a1f01b7b --- /dev/null +++ b/index.html @@ -0,0 +1,9 @@ + + + + + + Automatic redirection failed, please go to doc/index.html. + + + diff --git a/introduction.htm b/introduction.htm deleted file mode 100644 index bcac99bb..00000000 --- a/introduction.htm +++ /dev/null @@ -1,476 +0,0 @@ - - - - - - - -regex++, Introduction - - - - -

 

- - - - - - -

C++ Boost

-

Regex++, Introduction.

-

Copyright (c) 1998-2001

-

Dr John Maddock

-

Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty.

-
- -
- -

Introduction

- -

Regular expressions are a form of pattern-matching that are -often used in text processing; many users will be familiar with -the Unix utilities grep, sed and awk, and -the programming language perl, each of which make -extensive use of regular expressions. Traditionally C++ users -have been limited to the POSIX C API's for manipulating regular -expressions, and while regex++ does provide these API's, they do -not represent the best way to use the library. For example regex++ -can cope with wide character strings, or search and replace -operations (in a manner analogous to either sed or perl), -something that traditional C libraries can not do.

- -

The class boost::reg_expression -is the key class in this library; it represents a "machine -readable" regular expression, and is very closely modelled -on std::basic_string, think of it as a string plus the actual -state-machine required by the regular expression algorithms. Like -std::basic_string there are two typedefs that are almost always -the means by which this class is referenced:

- -
namespace boost{
-
-template <class charT, 
-          class traits = regex_traits<charT>, 
-          class Allocator = std::allocator<charT> >
-class reg_expression;
-
-typedef reg_expression<char> regex;
-typedef reg_expression<wchar_t> wregex;
-
-}
- -

To see how this library can be used, imagine that we are -writing a credit card processing application. Credit card numbers -generally come as a string of 16-digits, separated into groups of -4-digits, and separated by either a space or a hyphen. Before -storing a credit card number in a database (not necessarily -something your customers will appreciate!), we may want to verify -that the number is in the correct format. To match any digit we -could use the regular expression [0-9], however ranges of -characters like this are actually locale dependent. Instead we -should use the POSIX standard form [[:digit:]], or the regex++ -and perl shorthand for this \d (note that many older libraries -tended to be hard-coded to the C-locale, consequently this was -not an issue for them). That leaves us with the following regular -expression to validate credit card number formats:

- -

(\d{4}[- ]){3}\d{4}

- -

Here the parenthesis act to group (and mark for future -reference) sub-expressions, and the {4} means "repeat -exactly 4 times". This is an example of the extended regular -expression syntax used by perl, awk and egrep. Regex++ also -supports the older "basic" syntax used by sed and grep, -but this is generally less useful, unless you already have some -basic regular expressions that you need to reuse.

- -

Now lets take that expression and place it in some C++ code to -validate the format of a credit card number:

- -
bool validate_card_format(const std::string s)
-{
-   static const boost::regex e("(\\d{4}[- ]){3}\\d{4}");
-   return regex_match(s, e);
-}
- -

Note how we had to add some extra escapes to the expression: -remember that the escape is seen once by the C++ compiler, before -it gets to be seen by the regular expression engine, consequently -escapes in regular expressions have to be doubled up when -embedding them in C/C++ code. Also note that all the examples -assume that your compiler supports Koenig lookup, if yours -doesn't (for example VC6), then you will have to add some boost:: -prefixes to some of the function calls in the examples.

- -

Those of you who are familiar with credit card processing, -will have realised that while the format used above is suitable -for human readable card numbers, it does not represent the format -required by online credit card systems; these require the number -as a string of 16 (or possibly 15) digits, without any -intervening spaces. What we need is a means to convert easily -between the two formats, and this is where search and replace -comes in. Those who are familiar with the utilities sed -and perl will already be ahead here; we need two strings - -one a regular expression - the other a "format string" that provides a -description of the text to replace the match with. In regex++ -this search and replace operation is performed with the algorithm -regex_merge, for our credit card example we can write two -algorithms like this to provide the format conversions:

- -
-// match any format with the regular expression:
-const boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z");
-const std::string machine_format("\\1\\2\\3\\4");
-const std::string human_format("\\1-\\2-\\3-\\4");
-
-std::string machine_readable_card_number(const std::string s)
-{
-   return regex_merge(s, e, machine_format, boost::match_default | boost::format_sed);
-}
-
-std::string human_readable_card_number(const std::string s)
-{
-   return regex_merge(s, e, human_format, boost::match_default | boost::format_sed);
-}
- -

Here we've used marked sub-expressions in the regular -expression to split out the four parts of the card number as -separate fields, the format string then uses the sed-like syntax -to replace the matched text with the reformatted version.

- -

In the examples above, we haven't directly manipulated the -results of a regular expression match, however in general the -result of a match contains a number of sub-expression matches in -addition to the overall match. When the library needs to report a -regular expression match it does so using an instance of the -class match_results, -as before there are typedefs of this class for the most common -cases:

- -
namespace boost{
-typedef match_results<const char*> cmatch;
-typedef match_results<const wchar_t*> wcmatch;
-typedef match_results<std::string::const_iterator> smatch;
-typedef match_results<std::wstring::const_iterator> wsmatch; 
-}
- -

The algorithms regex_search -and regex_grep (i.e. -finding all matches in a string) make use of match_results to -report what matched.

- -

Note that these algorithms are not restricted to searching -regular C-strings, any bidirectional iterator type can be -searched, allowing for the possibility of seamlessly searching -almost any kind of data.

- -

For search and replace operations in addition to the algorithm -regex_merge that -we have already seen, the algorithm regex_format takes -the result of a match and a format string, and produces a new -string by merging the two.

- -

For those that dislike templates, there is a high level -wrapper class RegEx that is an encapsulation of the lower level -template code - it provides a simplified interface for those that -don't need the full power of the library, and supports only -narrow characters, and the "extended" regular -expression syntax.

- -

The POSIX API functions: -regcomp, regexec, regfree and regerror, are available in both -narrow character and Unicode versions, and are provided for those -who need compatibility with these API's.

- -

Finally, note that the library now has run-time localization support, and -recognizes the full POSIX regular expression syntax - including -advanced features like multi-character collating elements and -equivalence classes - as well as providing compatibility with -other regular expression libraries including GNU and BSD4 regex -packages, and to a more limited extent perl 5.

- -

Installation and Configuration -Options

- -

[ Important: If you are -upgrading from the 2.x version of this library then you will find -a number of changes to the documented header names and library -interfaces, existing code should still compile unchanged however -- see Note -for Upgraders. ]

- -

When you extract the library from its zip file, you must -preserve its internal directory structure (for example by using -the -d option when extracting). If you didn't do that when -extracting, then you'd better stop reading this, delete the files -you just extracted, and try again!

- -

This library should not need configuring before use; most -popular compilers/standard libraries/platforms are already -supported "as is". If you do experience configuration -problems, or just want to test the configuration with your -compiler, then the process is the same as for all of boost; see -the configuration library -documentation.

- -

The library will encase all code inside namespace boost.

- -

Unlike some other template libraries, this library consists of -a mixture of template code (in the headers) and static code and -data (in cpp files). Consequently it is necessary to build the -library's support code into a library or archive file before you -can use it, instructions for specific platforms are as follows:

- -

Borland C++ Builder:

- -
    -
  • Open up a console window and change to the - <boost>\libs\regex\build directory.
  • -
  • Select the appropriate makefile (bcb4.mak for C++ Builder - 4, bcb5.mak for C++ Builder 5, and bcb6.mak for C++ - Builder 6).
  • -
  • Invoke the makefile (pass the full path to your version - of make if you have more than one version installed, the - makefile relies on the path to make to obtain your C++ - Builder installation directory and tools) for example:
  • -
- -
make -fbcb5.mak
- -

The build process will build a variety of .lib and .dll files -(the exact number depends upon the version of Borland's tools you -are using) the .lib and dll files will be in a sub-directory -called bcb4 or bcb5 depending upon the makefile used. To install -the libraries into your development system use:

- -

make -fbcb5.mak install

- -

library files will be copied to <BCROOT>/lib and the -dll's to <BCROOT>/bin, where <BCROOT> corresponds to -the install path of your Borland C++ tools.

- -

You may also remove temporary files created during the build -process (excluding lib and dll files) by using:

- -

make -fbcb5.mak clean

- -

Finally when you use regex++ it is only necessary for you to -add the <boost> root director to your list of include -directories for that project. It is not necessary for you to -manually add a .lib file to the project; the headers will -automatically select the correct .lib file for your build mode -and tell the linker to include it. There is one caveat however: -the library can not tell the difference between VCL and non-VCL -enabled builds when building a GUI application from the command -line, if you build from the command line with the 5.5 command -line tools then you must define the pre-processor symbol _NO_VCL -in order to ensure that the correct link libraries are selected: -the C++ Builder IDE normally sets this automatically. Hint, users -of the 5.5 command line tools may want to add a -D_NO_VCL to bcc32.cfg -in order to set this option permanently.

- -

If you would prefer to do a static link to the regex libraries -even when using the dll runtime then define -BOOST_REGEX_STATIC_LINK, and if you want to suppress automatic -linking altogether (and supply your own custom build of the lib) -then define BOOST_REGEX_NO_LIB.

- -

If you are building with C++ Builder 6, you will find that -<boost/regex.hpp> can not be used in a pre-compiled header -(the actual problem is in <locale> which gets included by -<boost/regex.hpp>), if this causes problems for you, then -try defining BOOST_NO_STD_LOCALE when building, this will disable -some features throughout boost, but may save you a lot in compile -times!

- -

Microsoft Visual C++ 6 and 7

- -

You need version 6 of MSVC to build this library. If you are -using VC5 then you may want to look at one of the previous -releases of this library -

- -

Open up a command prompt, which has the necessary MSVC -environment variables defined (for example by using the batch -file Vcvars32.bat installed by the Visual Studio installation), -and change to the <boost>\libs\regex\build directory.

- -

Select the correct makefile - vc6.mak for "vanilla" -Visual C++ 6 or vc6-stlport.mak if you are using STLPort.

- -

Invoke the makefile like this:

- -

nmake -fvc6.mak

- -

You will now have a collection of lib and dll files in a -"vc6" subdirectory, to install these into your -development system use:

- -

nmake -fvc6.mak install

- -

The lib files will be copied to your <VC6>\lib directory -and the dll files to <VC6>\bin, where <VC6> is the -root of your Visual C++ 6 installation.

- -

You can delete all the temporary files created during the -build (excluding lib and dll files) using:

- -

nmake -fvc6.mak clean

- -

Finally when you use regex++ it is only necessary for you to -add the <boost> root directory to your list of include -directories for that project. It is not necessary for you to -manually add a .lib file to the project; the headers will -automatically select the correct .lib file for your build mode -and tell the linker to include it.

- -

Note that if you want to statically link to the regex library -when using the dynamic C++ runtime, define -BOOST_REGEX_STATIC_LINK when building your project (this only has -an effect for release builds). If you want to add the source -directly to your project then define BOOST_REGEX_NO_LIB to -disable automatic library selection.

- -

Important: there have been some -reports of compiler-optimisation bugs affecting this library, (particularly -with VC6 versions prior to service patch 5) the workaround is to -build the library using /Oityb1 rather than /O2. That is to use -all optimisation settings except /Oa. This problem is reported to -affect some standard library code as well (in fact I'm not sure -if the problem is with the regex code or the underlying standard -library), so it's probably worthwhile applying this workaround in -normal practice in any case.

- -

Note: if you have replaced the C++ standard library that comes -with VC6, then when you build the library you must ensure that -the environment variables "INCLUDE" and "LIB" -have been updated to reflect the include and library paths for -the new library - see vcvars32.bat (part of your Visual Studio -installation) for more details. Alternatively if STLPort is in c:/stlport -then you could use:

- -

nmake INCLUDES="-Ic:/stlport/stlport" XLFLAGS="/LIBPATH:c:/stlport/lib" --fvc6-stlport.mak

- -

If you are building with the full STLPort v4.x, then use the -vc6-stlport.mak file provided and set the environment variable -STLPORT_PATH to point to the location of your STLport -installation (Note that the full STLPort libraries appear not to -support single-thread static builds).

- -

GCC(2.95)

- -

There is a conservative makefile for the g++ compiler. From -the command prompt change to the <boost>/libs/regex/build -directory and type:

- -

make -fgcc.mak

- -

At the end of the build process you should have a gcc sub-directory -containing release and debug versions of the library (libboost_regex.a -and libboost_regex_debug.a). When you build projects that use -regex++, you will need to add the boost install directory to your -list of include paths and add <boost>/libs/regex/build/gcc/libboost_regex.a -to your list of library files.

- -

There is also a makefile to build the library as a shared -library:

- -

make -fgcc-shared.mak

- -

which will build libboost_regex.so and libboost_regex_debug.so.

- -

Both of the these makefiles support the following environment -variables:

- -

CXXFLAGS: extra compiler options - note that this applies to -both the debug and release builds.

- -

INCLUDES: additional include directories.

- -

LDFLAGS: additional linker options.

- -

LIBS: additional library files.

- -

For the more adventurous there is a configure script in -<boost>/libs/config; see the config -library documentation.

- -

Sun Workshop 6.1

- -

There is a makefile for the sun (6.1) compiler (C++ version 3.12). -From the command prompt change to the <boost>/libs/regex/build -directory and type:

- -

dmake -f sunpro.mak

- -

At the end of the build process you should have a sunpro sub-directory -containing single and multithread versions of the library (libboost_regex.a, -libboost_regex.so, libboost_regex_mt.a and libboost_regex_mt.so). -When you build projects that use regex++, you will need to add -the boost install directory to your list of include paths and add -<boost>/libs/regex/build/sunpro/ to your library search -path.

- -

Both of the these makefiles support the following environment -variables:

- -

CXXFLAGS: extra compiler options - note that this applies to -both the single and multithreaded builds.

- -

INCLUDES: additional include directories.

- -

LDFLAGS: additional linker options.

- -

LIBS: additional library files.

- -

LIBSUFFIX: a suffix to mangle the library name with (defaults -to nothing).

- -

This makefile does not set any architecture specific options -like -xarch=v9, you can set these by defining the appropriate -macros, for example:

- -

dmake CXXFLAGS="-xarch=v9" LDFLAGS="-xarch=v9" -LIBSUFFIX="_v9" -f sunpro.mak

- -

will build v9 variants of the regex library named -libboost_regex_v9.a etc.

- -

Other compilers:

- -

There is a generic makefile (generic.mak) -provided in <boost-root>/libs/regex/build - see that -makefile for details of environment variables that need to be set -before use. Alternatively you can using the Jam based build system. -If you need to configure the library for your platform, then -refer to the config library -documentation.

- -
- -

Copyright Dr -John Maddock 1998-2001 all rights reserved.

- - diff --git a/performance/Jamfile b/performance/Jamfile new file mode 100644 index 00000000..d3a58ee6 --- /dev/null +++ b/performance/Jamfile @@ -0,0 +1,43 @@ + +subproject libs/regex/performance ; + +SOURCES = command_line main time_boost time_greta time_localised_boost time_pcre time_posix time_safe_greta ; + +if $(HS_REGEX_PATH) +{ + HS_SOURCES = $(HS_REGEX_PATH)/regcomp.c $(HS_REGEX_PATH)/regerror.c $(HS_REGEX_PATH)/regexec.c $(HS_REGEX_PATH)/regfree.c ; + POSIX_OPTS = BOOST_HAS_POSIX=1 $(HS_REGEX_PATH) ; +} +else if $(USE_POSIX) +{ + POSIX_OPTS = BOOST_HAS_POSIX=1 ; +} + +if $(PCRE_PATH) +{ + PCRE_SOURCES = $(PCRE_PATH)/chartables.c $(PCRE_PATH)/get.c $(PCRE_PATH)/pcre.c $(PCRE_PATH)/study.c ; + PCRE_OPTS = BOOST_HAS_PCRE=1 $(PCRE_PATH) ; +} +else if $(USE_PCRE) +{ + PCRE_OPTS = BOOST_HAS_PCRE=1 pcre ; +} + + +exe regex_comparison : + $(SOURCES).cpp + $(HS_SOURCES) + $(PCRE_SOURCES) + ../build/boost_regex + ../../test/build/boost_prg_exec_monitor + : + $(BOOST_ROOT) + BOOST_REGEX_NO_LIB=1 + BOOST_REGEX_STATIC_LINK=1 + $(POSIX_OPTS) + $(PCRE_OPTS) + ; + + + + diff --git a/performance/command_line.cpp b/performance/command_line.cpp new file mode 100644 index 00000000..b74143c3 --- /dev/null +++ b/performance/command_line.cpp @@ -0,0 +1,470 @@ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "regex_comparison.hpp" + +#ifdef BOOST_HAS_PCRE +#include "pcre.h" // for pcre version number +#endif + +// +// globals: +// +bool time_boost = false; +bool time_localised_boost = false; +bool time_greta = false; +bool time_safe_greta = false; +bool time_posix = false; +bool time_pcre = false; + +bool test_matches = false; +bool test_code = false; +bool test_html = false; +bool test_short_twain = false; +bool test_long_twain = false; + + +std::string html_template_file; +std::string html_out_file; +std::string html_contents; +std::list result_list; + +// the following let us compute averages: +double greta_total = 0; +double safe_greta_total = 0; +double boost_total = 0; +double locale_boost_total = 0; +double posix_total = 0; +double pcre_total = 0; +unsigned greta_test_count = 0; +unsigned safe_greta_test_count = 0; +unsigned boost_test_count = 0; +unsigned locale_boost_test_count = 0; +unsigned posix_test_count = 0; +unsigned pcre_test_count = 0; + +int handle_argument(const std::string& what) +{ + if(what == "-b") + time_boost = true; + else if(what == "-bl") + time_localised_boost = true; +#ifdef BOOST_HAS_GRETA + else if(what == "-g") + time_greta = true; + else if(what == "-gs") + time_safe_greta = true; +#endif +#ifdef BOOST_HAS_POSIX + else if(what == "-posix") + time_posix = true; +#endif +#ifdef BOOST_HAS_PCRE + else if(what == "-pcre") + time_pcre = true; +#endif + else if(what == "-all") + { + time_boost = true; + time_localised_boost = true; +#ifdef BOOST_HAS_GRETA + time_greta = true; + time_safe_greta = true; +#endif +#ifdef BOOST_HAS_POSIX + time_posix = true; +#endif +#ifdef BOOST_HAS_PCRE + time_pcre = true; +#endif + } + else if(what == "-test-matches") + test_matches = true; + else if(what == "-test-code") + test_code = true; + else if(what == "-test-html") + test_html = true; + else if(what == "-test-short-twain") + test_short_twain = true; + else if(what == "-test-long-twain") + test_long_twain = true; + else if(what == "-test-all") + { + test_matches = true; + test_code = true; + test_html = true; + test_short_twain = true; + test_long_twain = true; + } + else if((what == "-h") || (what == "--help")) + return show_usage(); + else if((what[0] == '-') || (what[0] == '/')) + { + std::cerr << "Unknown argument: \"" << what << "\"" << std::endl; + return 1; + } + else if(html_template_file.size() == 0) + { + html_template_file = what; + load_file(html_contents, what.c_str()); + } + else if(html_out_file.size() == 0) + html_out_file = what; + else + { + std::cerr << "Unexpected argument: \"" << what << "\"" << std::endl; + return 1; + } + return 0; +} + +int show_usage() +{ + std::cout << + "Usage\n" + "regex_comparison [-h] [library options] [test options] [html_template html_output_file]\n" + " -h Show help\n\n" + " library options:\n" + " -b Apply tests to boost library\n" + " -bl Apply tests to boost library with C++ locale\n" +#ifdef BOOST_HAS_GRETA + " -g Apply tests to GRETA library\n" + " -gs Apply tests to GRETA library (in non-recursive mode)\n" +#endif +#ifdef BOOST_HAS_POSIX + " -posix Apply tests to POSIX library\n" +#endif +#ifdef BOOST_HAS_PCRE + " -pcre Apply tests to PCRE library\n" +#endif + " -all Apply tests to all libraries\n\n" + " test options:\n" + " -test-matches Test short matches\n" + " -test-code Test c++ code examples\n" + " -test-html Test c++ code examples\n" + " -test-short-twain Test short searches\n" + " -test-long-twain Test long searches\n" + " -test-all Test everthing\n"; + return 1; +} + +void load_file(std::string& text, const char* file) +{ + std::deque temp_copy; + std::ifstream is(file); + if(!is.good()) + { + std::string msg("Unable to open file: \""); + msg.append(file); + msg.append("\""); + throw std::runtime_error(msg); + } + is.seekg(0, std::ios_base::end); + std::istream::pos_type pos = is.tellg(); + is.seekg(0, std::ios_base::beg); + text.erase(); + text.reserve(pos); + std::istreambuf_iterator it(is); + std::copy(it, std::istreambuf_iterator(), std::back_inserter(text)); +} + +void print_result(std::ostream& os, double time, double best) +{ + static const char* suffixes[] = {"s", "ms", "us", "ns", "ps", }; + + if(time < 0) + { + os << "NA"; + return; + } + double rel = time / best; + bool highlight = ((rel > 0) && (rel < 1.1)); + unsigned suffix = 0; + while(time < 0) + { + time *= 1000; + ++suffix; + } + os << ""; + if(highlight) + os << ""; + if(rel <= 1000) + os << std::setprecision(3) << rel; + else + os << (int)rel; + os << "
("; + if(time <= 1000) + os << std::setprecision(3) << time; + else + os << (int)time; + os << suffixes[suffix] << ")"; + if(highlight) + os << "
"; + os << ""; +} + +std::string html_quote(const std::string& in) +{ + static const boost::regex e("(<)|(>)|(&)|(\")"); + static const std::string format("(?1<)(?2>)(?3&)(?4")"); + return regex_replace(in, e, format, boost::match_default | boost::format_all); +} + +void output_html_results(bool show_description, const std::string& tagname) +{ + std::stringstream os; + if(result_list.size()) + { + // + // start by outputting the table header: + // + os << "\n"; + os << ""; + if(show_description) + os << ""; +#if defined(BOOST_HAS_GRETA) + if(time_greta == true) + os << ""; + if(time_safe_greta == true) + os << ""; +#endif + if(time_boost == true) + os << ""; + if(time_localised_boost == true) + os << ""; +#if defined(BOOST_HAS_POSIX) + if(time_posix == true) + os << ""; +#endif +#ifdef BOOST_HAS_PCRE + if(time_pcre == true) + os << ""; +#endif + os << "\n"; + + // + // Now enumerate through all the test results: + // + std::list::const_iterator first, last; + first = result_list.begin(); + last = result_list.end(); + while(first != last) + { + os << ""; + if(show_description) + os << ""; +#if defined(BOOST_HAS_GRETA) + if(time_greta == true) + { + print_result(os, first->greta_time, first->factor); + if(first->greta_time > 0) + { + greta_total += first->greta_time / first->factor; + ++greta_test_count; + } + } + if(time_safe_greta == true) + { + print_result(os, first->safe_greta_time, first->factor); + if(first->safe_greta_time > 0) + { + safe_greta_total += first->safe_greta_time / first->factor; + ++safe_greta_test_count; + } + } +#endif +#if defined(BOOST_HAS_POSIX) + if(time_boost == true) + { + print_result(os, first->boost_time, first->factor); + if(first->boost_time > 0) + { + boost_total += first->boost_time / first->factor; + ++boost_test_count; + } + } + if(time_localised_boost == true) + { + print_result(os, first->localised_boost_time, first->factor); + if(first->localised_boost_time > 0) + { + locale_boost_total += first->localised_boost_time / first->factor; + ++locale_boost_test_count; + } + } +#endif + if(time_posix == true) + { + print_result(os, first->posix_time, first->factor); + if(first->posix_time > 0) + { + posix_total += first->posix_time / first->factor; + ++posix_test_count; + } + } +#if defined(BOOST_HAS_PCRE) + if(time_pcre == true) + { + print_result(os, first->pcre_time, first->factor); + if(first->pcre_time > 0) + { + pcre_total += first->pcre_time / first->factor; + ++pcre_test_count; + } + } +#endif + os << "\n"; + ++first; + } + os << "
ExpressionTextGRETAGRETA
(non-recursive mode)
BoostBoost + C++ localePOSIXPCRE
" << html_quote(first->expression) << "" << html_quote(first->description) << "
\n"; + result_list.clear(); + } + else + { + os << "

Results not available...

\n"; + } + + std::string result = os.str(); + + std::string::size_type pos = html_contents.find(tagname); + if(pos != std::string::npos) + { + html_contents.replace(pos, tagname.size(), result); + } +} + +std::string get_boost_version() +{ + std::stringstream os; + os << (BOOST_VERSION / 100000) << '.' << ((BOOST_VERSION / 100) % 1000) << '.' << (BOOST_VERSION % 100); + return os.str(); +} + +std::string get_averages_table() +{ + std::stringstream os; + // + // start by outputting the table header: + // + os << "\n"; + os << ""; +#if defined(BOOST_HAS_GRETA) + if(time_greta == true) + { + os << ""; + } + if(time_safe_greta == true) + { + os << ""; + } + +#endif + if(time_boost == true) + { + os << ""; + } + if(time_localised_boost == true) + { + os << ""; + } +#if defined(BOOST_HAS_POSIX) + if(time_posix == true) + { + os << ""; + } +#endif +#ifdef BOOST_HAS_PCRE + if(time_pcre == true) + { + os << ""; + } +#endif + os << "\n"; + + // + // Now enumerate through all averages: + // + os << ""; +#if defined(BOOST_HAS_GRETA) + if(time_greta == true) + os << "\n"; + if(time_safe_greta == true) + os << "\n"; +#endif +#if defined(BOOST_HAS_POSIX) + if(time_boost == true) + os << "\n"; + if(time_localised_boost == true) + os << "\n"; +#endif + if(time_posix == true) + os << "\n"; +#if defined(BOOST_HAS_PCRE) + if(time_pcre == true) + os << "\n"; +#endif + os << "\n"; + os << "
GRETAGRETA
(non-recursive mode)
BoostBoost + C++ localePOSIXPCRE
" << (greta_total / greta_test_count) << "" << (safe_greta_total / safe_greta_test_count) << "" << (boost_total / boost_test_count) << "" << (locale_boost_total / locale_boost_test_count) << "" << (posix_total / posix_test_count) << "" << (pcre_total / pcre_test_count) << "
\n"; + return os.str(); +} + +void output_final_html() +{ + if(html_out_file.size()) + { + // + // start with search and replace ops: + // + std::string::size_type pos; + pos = html_contents.find("%compiler%"); + if(pos != std::string::npos) + { + html_contents.replace(pos, 10, BOOST_COMPILER); + } + pos = html_contents.find("%library%"); + if(pos != std::string::npos) + { + html_contents.replace(pos, 9, BOOST_STDLIB); + } + pos = html_contents.find("%os%"); + if(pos != std::string::npos) + { + html_contents.replace(pos, 4, BOOST_PLATFORM); + } + pos = html_contents.find("%boost%"); + if(pos != std::string::npos) + { + html_contents.replace(pos, 7, get_boost_version()); + } + pos = html_contents.find("%pcre%"); + if(pos != std::string::npos) + { +#ifdef PCRE_MINOR + html_contents.replace(pos, 6, BOOST_STRINGIZE(PCRE_MAJOR.PCRE_MINOR)); +#else + html_contents.replace(pos, 6, "N/A"); +#endif + } + pos = html_contents.find("%averages%"); + if(pos != std::string::npos) + { + html_contents.replace(pos, 10, get_averages_table()); + } + // + // now right the output to file: + // + std::ofstream os(html_out_file.c_str()); + os << html_contents; + } + else + { + std::cout << html_contents; + } +} \ No newline at end of file diff --git a/performance/input.html b/performance/input.html new file mode 100644 index 00000000..85ca5dba --- /dev/null +++ b/performance/input.html @@ -0,0 +1,70 @@ + + + Regular Expression Performance Comparison + + + + + + +

Regular Expression Performance Comparison

+

+ The following tables provide comparisons between the following regular + expression libraries:

+

GRETA.

+

The Boost regex library.

+

Henry Spencer's regular expression library + - this is provided for comparison as a typical non-backtracking implementation.

+

Philip Hazel's PCRE library.

+

Details

+

Machine: Intel Pentium 4 2.8GHz PC.

+

Compiler: %compiler%.

+

C++ Standard Library: %library%.

+

OS: %os%.

+

Boost version: %boost%.

+

PCRE version: %pcre%.

+

+ As ever care should be taken in interpreting the results, only sensible regular + expressions (rather than pathological cases) are given, most are taken from the + Boost regex examples, or from the Library of + Regular Expressions. In addition, some variation in the relative + performance of these libraries can be expected on other machines - as memory + access and processor caching effects can be quite large for most finite state + machine algorithms.

+

Averages

+

The following are the average relative scores for all the tests: the perfect + regular expression library would score 1, in practice anything less than 2 + is pretty good.

+

%averages%

+

Comparison 1: Long Search

+

For each of the following regular expressions the time taken to find all + occurrences of the expression within a long English language text was measured + (mtent12.txt + from Project Gutenberg, 19Mb). 

+

%long_twain_search%

+

Comparison 2: Medium Sized Search

+

For each of the following regular expressions the time taken to find all + occurrences of the expression within a medium sized English language text was + measured (the first 50K from mtent12.txt). 

+

%short_twain_search%

+

Comparison 3: C++ Code Search

+

For each of the following regular expressions the time taken to find all + occurrences of the expression within the C++ source file + boost/crc.hpp was measured. 

+

%code_search%

+

+

Comparison 4: HTML Document Search

+ +

For each of the following regular expressions the time taken to find all + occurrences of the expression within the html file libs/libraries.htm + was measured. 

+

%html_search%

+

Comparison 3: Simple Matches

+

+ For each of the following regular expressions the time taken to match against + the text indicated was measured. 

+

%short_matches%

+
+

Copyright John Maddock April 2003, all rights reserved.

+ + diff --git a/performance/main.cpp b/performance/main.cpp new file mode 100644 index 00000000..96ecbaf8 --- /dev/null +++ b/performance/main.cpp @@ -0,0 +1,251 @@ +/* + * + * Copyright (c) 2002 + * Dr John Maddock + * + * Permission to use, copy, modify, distribute and sell this software + * and its documentation for any purpose is hereby granted without fee, + * provided that the above copyright notice appear in all copies and + * that both that copyright notice and this permission notice appear + * in supporting documentation. Dr John Maddock makes no representations + * about the suitability of this software for any purpose. + * It is provided "as is" without express or implied warranty. + * + */ + +#include +#include +#include +#include +#include +#include "regex_comparison.hpp" + + +void test_match(const std::string& re, const std::string& text, const std::string& description, bool icase) +{ + double time; + results r(re, description); + + std::cout << "Testing: \"" << re << "\" against \"" << description << "\"" << std::endl; + +#ifdef BOOST_HAS_GRETA + if(time_greta == true) + { + time = g::time_match(re, text, icase); + r.greta_time = time; + std::cout << "\tGRETA regex: " << time << "s\n"; + } + if(time_safe_greta == true) + { + time = gs::time_match(re, text, icase); + r.safe_greta_time = time; + std::cout << "\tSafe GRETA regex: " << time << "s\n"; + } +#endif + if(time_boost == true) + { + time = b::time_match(re, text, icase); + r.boost_time = time; + std::cout << "\tBoost regex: " << time << "s\n"; + } + if(time_localised_boost == true) + { + time = bl::time_match(re, text, icase); + r.localised_boost_time = time; + std::cout << "\tBoost regex (C++ locale): " << time << "s\n"; + } +#ifdef BOOST_HAS_POSIX + if(time_posix == true) + { + time = posix::time_match(re, text, icase); + r.posix_time = time; + std::cout << "\tPOSIX regex: " << time << "s\n"; + } +#endif +#ifdef BOOST_HAS_PCRE + if(time_pcre == true) + { + time = pcr::time_match(re, text, icase); + r.pcre_time = time; + std::cout << "\tPCRE regex: " << time << "s\n"; + } +#endif + r.finalise(); + result_list.push_back(r); +} + +void test_find_all(const std::string& re, const std::string& text, const std::string& description, bool icase) +{ + std::cout << "Testing: " << re << std::endl; + + double time; + results r(re, description); + +#ifdef BOOST_HAS_GRETA + if(time_greta == true) + { + time = g::time_find_all(re, text, icase); + r.greta_time = time; + std::cout << "\tGRETA regex: " << time << "s\n"; + } + if(time_safe_greta == true) + { + time = gs::time_find_all(re, text, icase); + r.safe_greta_time = time; + std::cout << "\tSafe GRETA regex: " << time << "s\n"; + } +#endif + if(time_boost == true) + { + time = b::time_find_all(re, text, icase); + r.boost_time = time; + std::cout << "\tBoost regex: " << time << "s\n"; + } + if(time_localised_boost == true) + { + time = bl::time_find_all(re, text, icase); + r.localised_boost_time = time; + std::cout << "\tBoost regex (C++ locale): " << time << "s\n"; + } +#ifdef BOOST_HAS_POSIX + if(time_posix == true) + { + time = posix::time_find_all(re, text, icase); + r.posix_time = time; + std::cout << "\tPOSIX regex: " << time << "s\n"; + } +#endif +#ifdef BOOST_HAS_PCRE + if(time_pcre == true) + { + time = pcr::time_find_all(re, text, icase); + r.pcre_time = time; + std::cout << "\tPCRE regex: " << time << "s\n"; + } +#endif + r.finalise(); + result_list.push_back(r); +} + +int cpp_main(int argc, char * argv[]) +{ + // start by processing the command line args: + if(argc < 2) + return show_usage(); + int result = 0; + for(int c = 1; c < argc; ++c) + { + result += handle_argument(argv[c]); + } + if(result) + return result; + + if(test_matches) + { + // start with a simple test, this is basically a measure of the minimal overhead + // involved in calling a regex matcher: + test_match("abc", "abc"); + // these are from the regex docs: + test_match("^([0-9]+)(\\-| |$)(.*)$", "100- this is a line of ftp response which contains a message string"); + test_match("([[:digit:]]{4}[- ]){3}[[:digit:]]{3,4}", "1234-5678-1234-456"); + // these are from http://www.regxlib.com/ + test_match("^([a-zA-Z0-9_\\-\\.]+)@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.)|(([a-zA-Z0-9\\-]+\\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\\]?)$", "john_maddock@compuserve.com"); + test_match("^([a-zA-Z0-9_\\-\\.]+)@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.)|(([a-zA-Z0-9\\-]+\\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\\]?)$", "foo12@foo.edu"); + test_match("^([a-zA-Z0-9_\\-\\.]+)@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.)|(([a-zA-Z0-9\\-]+\\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\\]?)$", "bob.smith@foo.tv"); + test_match("^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$", "EH10 2QQ"); + test_match("^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$", "G1 1AA"); + test_match("^[a-zA-Z]{1,2}[0-9][0-9A-Za-z]{0,1} {0,1}[0-9][A-Za-z]{2}$", "SW1 1ZZ"); + test_match("^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$", "4/1/2001"); + test_match("^[[:digit:]]{1,2}/[[:digit:]]{1,2}/[[:digit:]]{4}$", "12/12/2001"); + test_match("^[-+]?[[:digit:]]*\\.?[[:digit:]]*$", "123"); + test_match("^[-+]?[[:digit:]]*\\.?[[:digit:]]*$", "+3.14159"); + test_match("^[-+]?[[:digit:]]*\\.?[[:digit:]]*$", "-3.14159"); + } + output_html_results(true, "%short_matches%"); + + std::string file_contents; + + if(test_code) + { + load_file(file_contents, "../../../boost/crc.hpp"); + + const char* highlight_expression = // preprocessor directives: index 1 + "(^[ \t]*#(?:[^\\\\\\n]|\\\\[^\\n_[:punct:][:alnum:]]*[\\n[:punct:][:word:]])*)|" + // comment: index 2 + "(//[^\\n]*|/\\*.*?\\*/)|" + // literals: index 3 + "\\<([+-]?(?:(?:0x[[:xdigit:]]+)|(?:(?:[[:digit:]]*\\.)?[[:digit:]]+(?:[eE][+-]?[[:digit:]]+)?))u?(?:(?:int(?:8|16|32|64))|L)?)\\>|" + // string literals: index 4 + "('(?:[^\\\\']|\\\\.)*'|\"(?:[^\\\\\"]|\\\\.)*\")|" + // keywords: index 5 + "\\<(__asm|__cdecl|__declspec|__export|__far16|__fastcall|__fortran|__import" + "|__pascal|__rtti|__stdcall|_asm|_cdecl|__except|_export|_far16|_fastcall" + "|__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto|bool" + "|break|case|catch|cdecl|char|class|const|const_cast|continue|default|delete" + "|do|double|dynamic_cast|else|enum|explicit|extern|false|float|for|friend|goto" + "|if|inline|int|long|mutable|namespace|new|operator|pascal|private|protected" + "|public|register|reinterpret_cast|return|short|signed|sizeof|static|static_cast" + "|struct|switch|template|this|throw|true|try|typedef|typeid|typename|union|unsigned" + "|using|virtual|void|volatile|wchar_t|while)\\>" + ; + + const char* class_expression = "^(template[[:space:]]*<[^;:{]+>[[:space:]]*)?" + "(class|struct)[[:space:]]*(\\<\\w+\\>([ \t]*\\([^)]*\\))?" + "[[:space:]]*)*(\\<\\w*\\>)[[:space:]]*(<[^;:{]+>[[:space:]]*)?" + "(\\{|:[^;\\{()]*\\{)"; + + const char* include_expression = "^[ \t]*#[ \t]*include[ \t]+(\"[^\"]+\"|<[^>]+>)"; + const char* boost_include_expression = "^[ \t]*#[ \t]*include[ \t]+(\"boost/[^\"]+\"|]+>)"; + + + test_find_all(class_expression, file_contents); + test_find_all(highlight_expression, file_contents); + test_find_all(include_expression, file_contents); + test_find_all(boost_include_expression, file_contents); + } + output_html_results(false, "%code_search%"); + + if(test_html) + { + load_file(file_contents, "../../../libs/libraries.htm"); + test_find_all("beman|john|dave", file_contents, true); + test_find_all("

.*?

", file_contents, true); + test_find_all("]+href=(\"[^\"]*\"|[^[:space:]]+)[^>]*>", file_contents, true); + test_find_all("]*>.*?", file_contents, true); + test_find_all("]+src=(\"[^\"]*\"|[^[:space:]]+)[^>]*>", file_contents, true); + test_find_all("]+face=(\"[^\"]*\"|[^[:space:]]+)[^>]*>.*?", file_contents, true); + } + output_html_results(false, "%html_search%"); + + if(test_short_twain) + { + load_file(file_contents, "short_twain.txt"); + + test_find_all("Twain", file_contents); + test_find_all("Huck[[:alpha:]]+", file_contents); + test_find_all("[[:alpha:]]+ing", file_contents); + test_find_all("^[^\n]*?Twain", file_contents); + test_find_all("Tom|Sawyer|Huckleberry|Finn", file_contents); + test_find_all("(Tom|Sawyer|Huckleberry|Finn).{0,30}river|river.{0,30}(Tom|Sawyer|Huckleberry|Finn)", file_contents); + } + output_html_results(false, "%short_twain_search%"); + + if(test_long_twain) + { + load_file(file_contents, "mtent13.txt"); + + test_find_all("Twain", file_contents); + test_find_all("Huck[[:alpha:]]+", file_contents); + test_find_all("[[:alpha:]]+ing", file_contents); + test_find_all("^[^\n]*?Twain", file_contents); + test_find_all("Tom|Sawyer|Huckleberry|Finn", file_contents); + time_posix = false; + test_find_all("(Tom|Sawyer|Huckleberry|Finn).{0,30}river|river.{0,30}(Tom|Sawyer|Huckleberry|Finn)", file_contents); + time_posix = true; + } + output_html_results(false, "%long_twain_search%"); + + output_final_html(); + return 0; +} + diff --git a/performance/regex_comparison.hpp b/performance/regex_comparison.hpp new file mode 100644 index 00000000..0a695e3b --- /dev/null +++ b/performance/regex_comparison.hpp @@ -0,0 +1,136 @@ +/* + * + * Copyright (c) 2002 + * Dr John Maddock + * + * All rights reserved. + * May not be transfered or disclosed to a third party without + * prior consent of the author. + * + */ + + +#ifndef REGEX_COMPARISON_HPP +#define REGEX_COMPARISON_HPP + +#include +#include +#include + +// +// globals: +// +extern bool time_boost; +extern bool time_localised_boost; +extern bool time_greta; +extern bool time_safe_greta; +extern bool time_posix; +extern bool time_pcre; + +extern bool test_matches; +extern bool test_short_twain; +extern bool test_long_twain; +extern bool test_code; +extern bool test_html; + +extern std::string html_template_file; +extern std::string html_out_file; +extern std::string html_contents; + + +int handle_argument(const std::string& what); +int show_usage(); +void load_file(std::string& text, const char* file); +void output_html_results(bool show_description, const std::string& tagname); +void output_final_html(); + + +struct results +{ + double boost_time; + double localised_boost_time; + double greta_time; + double safe_greta_time; + double posix_time; + double pcre_time; + double factor; + std::string expression; + std::string description; + results(const std::string& ex, const std::string& desc) + : boost_time(-1), + localised_boost_time(-1), + greta_time(-1), + safe_greta_time(-1), + posix_time(-1), + pcre_time(-1), + factor(std::numeric_limits::max()), + expression(ex), + description(desc) + {} + void finalise() + { + if((boost_time >= 0) && (boost_time < factor)) + factor = boost_time; + if((localised_boost_time >= 0) && (localised_boost_time < factor)) + factor = localised_boost_time; + if((greta_time >= 0) && (greta_time < factor)) + factor = greta_time; + if((safe_greta_time >= 0) && (safe_greta_time < factor)) + factor = safe_greta_time; + if((posix_time >= 0) && (posix_time < factor)) + factor = posix_time; + if((pcre_time >= 0) && (pcre_time < factor)) + factor = pcre_time; + } +}; + +extern std::list result_list; + + +namespace b { +// boost tests: +double time_match(const std::string& re, const std::string& text, bool icase); +double time_find_all(const std::string& re, const std::string& text, bool icase); + +} +namespace bl { +// localised boost tests: +double time_match(const std::string& re, const std::string& text, bool icase); +double time_find_all(const std::string& re, const std::string& text, bool icase); + +} +namespace pcr { +// pcre tests: +double time_match(const std::string& re, const std::string& text, bool icase); +double time_find_all(const std::string& re, const std::string& text, bool icase); + +} +namespace g { +// greta tests: +double time_match(const std::string& re, const std::string& text, bool icase); +double time_find_all(const std::string& re, const std::string& text, bool icase); + +} +namespace gs { +// safe greta tests: +double time_match(const std::string& re, const std::string& text, bool icase); +double time_find_all(const std::string& re, const std::string& text, bool icase); + +} +namespace posix { +// safe greta tests: +double time_match(const std::string& re, const std::string& text, bool icase); +double time_find_all(const std::string& re, const std::string& text, bool icase); + +} +void test_match(const std::string& re, const std::string& text, const std::string& description, bool icase = false); +void test_find_all(const std::string& re, const std::string& text, const std::string& description, bool icase = false); +inline void test_match(const std::string& re, const std::string& text, bool icase = false) +{ test_match(re, text, text, icase); } +inline void test_find_all(const std::string& re, const std::string& text, bool icase = false) +{ test_find_all(re, text, "", icase); } + + +#define REPEAT_COUNT 10 + +#endif diff --git a/performance/time_boost.cpp b/performance/time_boost.cpp new file mode 100644 index 00000000..9dc3e791 --- /dev/null +++ b/performance/time_boost.cpp @@ -0,0 +1,98 @@ +/* + * + * Copyright (c) 2002 + * Dr John Maddock + * + * Permission to use, copy, modify, distribute and sell this software + * and its documentation for any purpose is hereby granted without fee, + * provided that the above copyright notice appear in all copies and + * that both that copyright notice and this permission notice appear + * in supporting documentation. Dr John Maddock makes no representations + * about the suitability of this software for any purpose. + * It is provided "as is" without express or implied warranty. + * + */ + +#include "regex_comparison.hpp" +#include +#include + +namespace b{ + +double time_match(const std::string& re, const std::string& text, bool icase) +{ + boost::regex e(re, (icase ? boost::regex::perl | boost::regex::icase : boost::regex::perl)); + boost::smatch what; + boost::timer tim; + int iter = 1; + int counter, repeats; + double result = 0; + double run; + do + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + boost::regex_match(text, what, e); + } + result = tim.elapsed(); + iter *= 2; + }while(result < 0.5); + iter /= 2; + + // repeat test and report least value for consistency: + for(repeats = 0; repeats < REPEAT_COUNT; ++repeats) + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + boost::regex_match(text, what, e); + } + run = tim.elapsed(); + result = std::min(run, result); + } + return result / iter; +} + +bool dummy_grep_proc(const boost::smatch&) +{ return true; } + +double time_find_all(const std::string& re, const std::string& text, bool icase) +{ + boost::regex e(re, (icase ? boost::regex::perl | boost::regex::icase : boost::regex::perl)); + boost::smatch what; + boost::timer tim; + int iter = 1; + int counter, repeats; + double result = 0; + double run; + do + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + boost::regex_grep(&dummy_grep_proc, text, e); + } + result = tim.elapsed(); + iter *= 2; + }while(result < 0.5); + iter /= 2; + + if(result >10) + return result / iter; + + // repeat test and report least value for consistency: + for(repeats = 0; repeats < REPEAT_COUNT; ++repeats) + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + boost::regex_grep(&dummy_grep_proc, text, e); + } + run = tim.elapsed(); + result = std::min(run, result); + } + return result / iter; +} + +} diff --git a/performance/time_greta.cpp b/performance/time_greta.cpp new file mode 100644 index 00000000..f6e4b309 --- /dev/null +++ b/performance/time_greta.cpp @@ -0,0 +1,125 @@ +/* + * + * Copyright (c) 2002 + * Dr John Maddock + * + * Permission to use, copy, modify, distribute and sell this software + * and its documentation for any purpose is hereby granted without fee, + * provided that the above copyright notice appear in all copies and + * that both that copyright notice and this permission notice appear + * in supporting documentation. Dr John Maddock makes no representations + * about the suitability of this software for any purpose. + * It is provided "as is" without express or implied warranty. + * + */ + +#include "regex_comparison.hpp" +#if defined(BOOST_HAS_GRETA) +#include +#include +#include "regexpr2.h" + +namespace g{ + +double time_match(const std::string& re, const std::string& text, bool icase) +{ + regex::rpattern e(re, (icase ? regex::MULTILINE | regex::NORMALIZE | regex::NOCASE : regex::MULTILINE | regex::NORMALIZE)); + regex::match_results what; + boost::timer tim; + int iter = 1; + int counter, repeats; + double result = 0; + double run; + assert(e.match(text, what)); + do + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + e.match(text, what); + } + result = tim.elapsed(); + iter *= 2; + }while(result < 0.5); + iter /= 2; + + // repeat test and report least value for consistency: + for(repeats = 0; repeats < REPEAT_COUNT; ++repeats) + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + e.match(text, what); + } + run = tim.elapsed(); + result = std::min(run, result); + } + return result / iter; +} + +double time_find_all(const std::string& re, const std::string& text, bool icase) +{ + regex::rpattern e(re, (icase ? regex::MULTILINE | regex::NORMALIZE | regex::NOCASE : regex::MULTILINE | regex::NORMALIZE)); + regex::match_results what; + boost::timer tim; + int iter = 1; + int counter, repeats; + double result = 0; + double run; + do + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + e.match(text.begin(), text.end(), what); + while(what.backref(0).matched) + { + e.match(what.backref(0).end(), text.end(), what); + } + } + result = tim.elapsed(); + iter *= 2; + }while(result < 0.5); + iter /= 2; + + if(result > 10) + return result / iter; + + // repeat test and report least value for consistency: + for(repeats = 0; repeats < REPEAT_COUNT; ++repeats) + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + e.match(text.begin(), text.end(), what); + while(what.backref(0).matched) + { + e.match(what.backref(0).end(), text.end(), what); + } + } + run = tim.elapsed(); + result = std::min(run, result); + } + return result / iter; +} + +} + +#else + +namespace g { + +double time_match(const std::string& re, const std::string& text, bool icase) +{ + return -1; +} + +double time_find_all(const std::string& re, const std::string& text, bool icase) +{ + return -1; +} + +} + +#endif + diff --git a/performance/time_localised_boost.cpp b/performance/time_localised_boost.cpp new file mode 100644 index 00000000..d1aeac89 --- /dev/null +++ b/performance/time_localised_boost.cpp @@ -0,0 +1,98 @@ +/* + * + * Copyright (c) 2002 + * Dr John Maddock + * + * Permission to use, copy, modify, distribute and sell this software + * and its documentation for any purpose is hereby granted without fee, + * provided that the above copyright notice appear in all copies and + * that both that copyright notice and this permission notice appear + * in supporting documentation. Dr John Maddock makes no representations + * about the suitability of this software for any purpose. + * It is provided "as is" without express or implied warranty. + * + */ + +#include "regex_comparison.hpp" +#include +#include + +namespace bl{ + +double time_match(const std::string& re, const std::string& text, bool icase) +{ + boost::reg_expression > e(re, (icase ? boost::regex::perl | boost::regex::icase : boost::regex::perl)); + boost::smatch what; + boost::timer tim; + int iter = 1; + int counter, repeats; + double result = 0; + double run; + do + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + boost::regex_match(text, what, e); + } + result = tim.elapsed(); + iter *= 2; + }while(result < 0.5); + iter /= 2; + + // repeat test and report least value for consistency: + for(repeats = 0; repeats < REPEAT_COUNT; ++repeats) + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + boost::regex_match(text, what, e); + } + run = tim.elapsed(); + result = std::min(run, result); + } + return result / iter; +} + +bool dummy_grep_proc(const boost::smatch&) +{ return true; } + +double time_find_all(const std::string& re, const std::string& text, bool icase) +{ + boost::reg_expression > e(re, (icase ? boost::regex::perl | boost::regex::icase : boost::regex::perl)); + boost::smatch what; + boost::timer tim; + int iter = 1; + int counter, repeats; + double result = 0; + double run; + do + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + boost::regex_grep(&dummy_grep_proc, text, e); + } + result = tim.elapsed(); + iter *= 2; + }while(result < 0.5); + iter /= 2; + + if(result >10) + return result / iter; + + // repeat test and report least value for consistency: + for(repeats = 0; repeats < REPEAT_COUNT; ++repeats) + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + boost::regex_grep(&dummy_grep_proc, text, e); + } + run = tim.elapsed(); + result = std::min(run, result); + } + return result / iter; +} + +} diff --git a/performance/time_pcre.cpp b/performance/time_pcre.cpp new file mode 100644 index 00000000..5956b521 --- /dev/null +++ b/performance/time_pcre.cpp @@ -0,0 +1,180 @@ +/* + * + * Copyright (c) 2002 + * Dr John Maddock + * + * Permission to use, copy, modify, distribute and sell this software + * and its documentation for any purpose is hereby granted without fee, + * provided that the above copyright notice appear in all copies and + * that both that copyright notice and this permission notice appear + * in supporting documentation. Dr John Maddock makes no representations + * about the suitability of this software for any purpose. + * It is provided "as is" without express or implied warranty. + * + */ + +#include +#include +#include "regex_comparison.hpp" +#ifdef BOOST_HAS_PCRE +#include "pcre.h" +#include + +namespace pcr{ + +double time_match(const std::string& re, const std::string& text, bool icase) +{ + pcre *ppcre; + const char *error; + int erroffset; + + int what[50]; + + boost::timer tim; + int iter = 1; + int counter, repeats; + double result = 0; + double run; + + if(0 == (ppcre = pcre_compile(re.c_str(), (icase ? PCRE_CASELESS | PCRE_ANCHORED | PCRE_DOTALL | PCRE_MULTILINE : PCRE_ANCHORED | PCRE_DOTALL | PCRE_MULTILINE), + &error, &erroffset, NULL))) + { + free(ppcre); + return -1; + } + + pcre_extra *pe; + pe = pcre_study(ppcre, 0, &error); + if(error) + { + free(ppcre); + free(pe); + return -1; + } + + do + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + erroffset = pcre_exec(ppcre, pe, text.c_str(), text.size(), 0, 0, what, sizeof(what)/sizeof(int)); + } + result = tim.elapsed(); + iter *= 2; + }while(result < 0.5); + iter /= 2; + + // repeat test and report least value for consistency: + for(repeats = 0; repeats < REPEAT_COUNT; ++repeats) + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + erroffset = pcre_exec(ppcre, pe, text.c_str(), text.size(), 0, 0, what, sizeof(what)/sizeof(int)); + } + run = tim.elapsed(); + result = std::min(run, result); + } + free(ppcre); + free(pe); + return result / iter; +} + +double time_find_all(const std::string& re, const std::string& text, bool icase) +{ + pcre *ppcre; + const char *error; + int erroffset; + + int what[50]; + + boost::timer tim; + int iter = 1; + int counter, repeats; + double result = 0; + double run; + int exec_result; + int matches; + + if(0 == (ppcre = pcre_compile(re.c_str(), (icase ? PCRE_CASELESS | PCRE_DOTALL | PCRE_MULTILINE : PCRE_DOTALL | PCRE_MULTILINE), &error, &erroffset, NULL))) + { + free(ppcre); + return -1; + } + + pcre_extra *pe; + pe = pcre_study(ppcre, 0, &error); + if(error) + { + free(ppcre); + free(pe); + return -1; + } + + do + { + int startoff; + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + matches = 0; + startoff = 0; + exec_result = pcre_exec(ppcre, pe, text.c_str(), text.size(), startoff, 0, what, sizeof(what)/sizeof(int)); + while(exec_result >= 0) + { + ++matches; + startoff = what[1]; + exec_result = pcre_exec(ppcre, pe, text.c_str(), text.size(), startoff, 0, what, sizeof(what)/sizeof(int)); + } + } + result = tim.elapsed(); + iter *= 2; + }while(result < 0.5); + iter /= 2; + + if(result >10) + return result / iter; + + result = DBL_MAX; + + // repeat test and report least value for consistency: + for(repeats = 0; repeats < REPEAT_COUNT; ++repeats) + { + int startoff; + matches = 0; + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + matches = 0; + startoff = 0; + exec_result = pcre_exec(ppcre, pe, text.c_str(), text.size(), startoff, 0, what, sizeof(what)/sizeof(int)); + while(exec_result >= 0) + { + ++matches; + startoff = what[1]; + exec_result = pcre_exec(ppcre, pe, text.c_str(), text.size(), startoff, 0, what, sizeof(what)/sizeof(int)); + } + } + run = tim.elapsed(); + result = std::min(run, result); + } + return result / iter; +} + +} +#else + +namespace pcr{ + +double time_match(const std::string& re, const std::string& text, bool icase) +{ + return -1; +} +double time_find_all(const std::string& re, const std::string& text, bool icase) +{ + return -1; +} + +} + +#endif \ No newline at end of file diff --git a/performance/time_posix.cpp b/performance/time_posix.cpp new file mode 100644 index 00000000..cd2cec68 --- /dev/null +++ b/performance/time_posix.cpp @@ -0,0 +1,143 @@ +/* + * + * Copyright (c) 2002 + * Dr John Maddock + * + * Permission to use, copy, modify, distribute and sell this software + * and its documentation for any purpose is hereby granted without fee, + * provided that the above copyright notice appear in all copies and + * that both that copyright notice and this permission notice appear + * in supporting documentation. Dr John Maddock makes no representations + * about the suitability of this software for any purpose. + * It is provided "as is" without express or implied warranty. + * + */ + +#include +#include +#include "regex_comparison.hpp" +#ifdef BOOST_HAS_POSIX +#include +#include "regex.h" + +namespace posix{ + +double time_match(const std::string& re, const std::string& text, bool icase) +{ + regex_t e; + regmatch_t what[20]; + boost::timer tim; + int iter = 1; + int counter, repeats; + double result = 0; + double run; + if(0 != regcomp(&e, re.c_str(), (icase ? REG_ICASE | REG_EXTENDED : REG_EXTENDED))) + return -1; + do + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + regexec(&e, text.c_str(), e.re_nsub, what, 0); + } + result = tim.elapsed(); + iter *= 2; + }while(result < 0.5); + iter /= 2; + + // repeat test and report least value for consistency: + for(repeats = 0; repeats < REPEAT_COUNT; ++repeats) + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + regexec(&e, text.c_str(), e.re_nsub, what, 0); + } + run = tim.elapsed(); + result = std::min(run, result); + } + regfree(&e); + return result / iter; +} + +double time_find_all(const std::string& re, const std::string& text, bool icase) +{ + regex_t e; + regmatch_t what[20]; + memset(what, 0, sizeof(what)); + boost::timer tim; + int iter = 1; + int counter, repeats; + double result = 0; + double run; + int exec_result; + int matches; + if(0 != regcomp(&e, re.c_str(), (icase ? REG_ICASE | REG_EXTENDED : REG_EXTENDED))) + return -1; + do + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + what[0].rm_so = 0; + what[0].rm_eo = text.size(); + matches = 0; + exec_result = regexec(&e, text.c_str(), 20, what, REG_STARTEND); + while(exec_result == 0) + { + ++matches; + what[0].rm_so = what[0].rm_eo; + what[0].rm_eo = text.size(); + exec_result = regexec(&e, text.c_str(), 20, what, REG_STARTEND); + } + } + result = tim.elapsed(); + iter *= 2; + }while(result < 0.5); + iter /= 2; + + if(result >10) + return result / iter; + + result = DBL_MAX; + + // repeat test and report least value for consistency: + for(repeats = 0; repeats < REPEAT_COUNT; ++repeats) + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + what[0].rm_so = 0; + what[0].rm_eo = text.size(); + matches = 0; + exec_result = regexec(&e, text.c_str(), 20, what, REG_STARTEND); + while(exec_result == 0) + { + ++matches; + what[0].rm_so = what[0].rm_eo; + what[0].rm_eo = text.size(); + exec_result = regexec(&e, text.c_str(), 20, what, REG_STARTEND); + } + } + run = tim.elapsed(); + result = std::min(run, result); + } + return result / iter; +} + +} +#else + +namespace posix{ + +double time_match(const std::string& re, const std::string& text, bool icase) +{ + return -1; +} +double time_find_all(const std::string& re, const std::string& text, bool icase) +{ + return -1; +} + +} +#endif \ No newline at end of file diff --git a/performance/time_safe_greta.cpp b/performance/time_safe_greta.cpp new file mode 100644 index 00000000..6c600bda --- /dev/null +++ b/performance/time_safe_greta.cpp @@ -0,0 +1,127 @@ +/* + * + * Copyright (c) 2002 + * Dr John Maddock + * + * Permission to use, copy, modify, distribute and sell this software + * and its documentation for any purpose is hereby granted without fee, + * provided that the above copyright notice appear in all copies and + * that both that copyright notice and this permission notice appear + * in supporting documentation. Dr John Maddock makes no representations + * about the suitability of this software for any purpose. + * It is provided "as is" without express or implied warranty. + * + */ + +#include "regex_comparison.hpp" +#if defined(BOOST_HAS_GRETA) + +#include +#include +#include "regexpr2.h" + +namespace gs{ + +double time_match(const std::string& re, const std::string& text, bool icase) +{ + regex::rpattern e(re, (icase ? regex::MULTILINE | regex::NORMALIZE | regex::NOCASE : regex::MULTILINE | regex::NORMALIZE), regex::MODE_SAFE); + regex::match_results what; + boost::timer tim; + int iter = 1; + int counter, repeats; + double result = 0; + double run; + assert(e.match(text, what)); + do + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + e.match(text, what); + } + result = tim.elapsed(); + iter *= 2; + }while(result < 0.5); + iter /= 2; + + // repeat test and report least value for consistency: + for(repeats = 0; repeats < REPEAT_COUNT; ++repeats) + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + e.match(text, what); + } + run = tim.elapsed(); + result = std::min(run, result); + } + return result / iter; +} + +double time_find_all(const std::string& re, const std::string& text, bool icase) +{ + regex::rpattern e(re, (icase ? regex::MULTILINE | regex::NORMALIZE | regex::NOCASE : regex::MULTILINE | regex::NORMALIZE), regex::MODE_SAFE); + regex::match_results what; + boost::timer tim; + int iter = 1; + int counter, repeats; + double result = 0; + double run; + do + { + bool r; + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + e.match(text.begin(), text.end(), what); + while(what.backref(0).matched) + { + e.match(what.backref(0).end(), text.end(), what); + } + } + result = tim.elapsed(); + iter *= 2; + }while(result < 0.5); + iter /= 2; + + if(result > 10) + return result / iter; + + // repeat test and report least value for consistency: + for(repeats = 0; repeats < REPEAT_COUNT; ++repeats) + { + tim.restart(); + for(counter = 0; counter < iter; ++counter) + { + e.match(text.begin(), text.end(), what); + while(what.backref(0).matched) + { + e.match(what.backref(0).end(), text.end(), what); + } + } + run = tim.elapsed(); + result = std::min(run, result); + } + return result / iter; +} + +} + +#else + +namespace gs{ + +double time_match(const std::string& re, const std::string& text, bool icase) +{ + return -1; +} + +double time_find_all(const std::string& re, const std::string& text, bool icase) +{ + return -1; +} + +} + +#endif + diff --git a/posix_ref.htm b/posix_ref.htm deleted file mode 100644 index ffe2e677..00000000 --- a/posix_ref.htm +++ /dev/null @@ -1,314 +0,0 @@ - - - - - - -Regex++, POSIX API Reference - - - - -

 

- - - - - - -

C++ Boost

-

Regex++, POSIX API - Reference.

-

Copyright (c) 1998-2001

-

Dr John Maddock

-

Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty.

-
- -
- -

POSIX compatibility library

- -
#include <boost/cregex.hpp>
-or:
-#include <boost/regex.h>
- -

The following functions are available for users who need a -POSIX compatible C library, they are available in both Unicode -and narrow character versions, the standard POSIX API names are -macros that expand to one version or the other depending upon -whether UNICODE is defined or not.

- -

Important: Note that all the symbols defined here are -enclosed inside namespace boost when used in C++ programs, -unless you use #include <boost/regex.h> instead - in which -case the symbols are still defined in namespace boost, but are -made available in the global namespace as well.

- -

The functions are defined as:

- -
extern "C" {
-int regcompA(regex_tA*, const char*, int);
-unsigned int regerrorA(int, const regex_tA*, char*, unsigned int);
-int regexecA(const regex_tA*, const char*, unsigned int, regmatch_t*, int);
-void regfreeA(regex_tA*);
-
-int regcompW(regex_tW*, const wchar_t*, int);
-unsigned int regerrorW(int, const regex_tW*, wchar_t*, unsigned int);
-int regexecW(const regex_tW*, const wchar_t*, unsigned int, regmatch_t*, int);
-void regfreeW(regex_tW*);
-
-#ifdef UNICODE
-#define regcomp regcompW
-#define regerror regerrorW
-#define regexec regexecW
-#define regfree regfreeW
-#define regex_t regex_tW
-#else
-#define regcomp regcompA
-#define regerror regerrorA
-#define regexec regexecA
-#define regfree regfreeA
-#define regex_t regex_tA
-#endif
-}
- -

All the functions operate on structure regex_t, which -exposes two public members:

- -

unsigned int re_nsub this is filled in by regcomp -and indicates the number of sub-expressions contained in the -regular expression.

- -

const TCHAR* re_endp points to the end of the -expression to compile when the flag REG_PEND is set.

- -

Footnote: regex_t is actually a #define - it is either -regex_tA or regex_tW depending upon whether UNICODE is defined or -not, TCHAR is either char or wchar_t again depending upon the -macro UNICODE.

- -

regcomp takes a pointer to a regex_t, a pointer -to the expression to compile and a flags parameter which can be a -combination of:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 REG_EXTENDEDCompiles modern regular - expressions. Equivalent to regbase::char_classes | - regbase::intervals | regbase::bk_refs. 
 REG_BASICCompiles basic (obsolete) - regular expression syntax. Equivalent to regbase::char_classes - | regbase::intervals | regbase::limited_ops | regbase::bk_braces - | regbase::bk_parens | regbase::bk_refs. 
 REG_NOSPECAll characters are ordinary, - the expression is a literal string. 
 REG_ICASECompiles for matching that - ignores character case. 
 REG_NOSUBHas no effect in this - library. 
 REG_NEWLINEWhen this flag is set a dot - does not match the newline character. 
 REG_PENDWhen this flag is set the - re_endp parameter of the regex_t structure must point to - the end of the regular expression to compile. 
 REG_NOCOLLATEWhen this flag is set then - locale dependent collation for character ranges is turned - off. 
 REG_ESCAPE_IN_LISTS
- , , ,
When this flag is set, then - escape sequences are permitted in bracket expressions (character - sets). 
 REG_NEWLINE_ALT When this flag is set then - the newline character is equivalent to the alternation - operator |. 
 REG_PERL  A shortcut for perl-like - behavior: REG_EXTENDED | REG_NOCOLLATE | - REG_ESCAPE_IN_LISTS 
 REG_AWKA shortcut for awk-like - behavior: REG_EXTENDED | REG_ESCAPE_IN_LISTS 
 REG_GREPA shortcut for grep like - behavior: REG_BASIC | REG_NEWLINE_ALT 
 REG_EGREP A shortcut for egrep - like behavior: REG_EXTENDED | REG_NEWLINE_ALT 
- -


- -

regerror takes the following parameters, it maps an -error code to a human readable string:

- - - - - - - - - - - - - - - - - - - - - - - - - - -
 int codeThe error code. 
 const regex_t* eThe regular expression (can - be null). 
 char* bufThe buffer to fill in with - the error message. 
 unsigned int buf_sizeThe length of buf. 
- -

If the error code is OR'ed with REG_ITOA then the message that -results is the printable name of the code rather than a message, -for example "REG_BADPAT". If the code is REG_ATIO then e -must not be null and e->re_pend must point to the -printable name of an error code, the return value is then the -value of the error code. For any other value of code, the -return value is the number of characters in the error message, if -the return value is greater than or equal to buf_size then -regerror will have to be called again with a larger buffer.

- -

regexec finds the first occurrence of expression e -within string buf. If len is non-zero then *m -is filled in with what matched the regular expression, m[0] -contains what matched the whole string, m[1] the first sub-expression -etc, see regmatch_t in the header file declaration for -more details. The eflags parameter can be a combination of: -

- - - - - - - - - - - - - - - - - - - - -
 REG_NOTBOLParameter buf does - not represent the start of a line. 
 REG_NOTEOLParameter buf does - not terminate at the end of a line. 
 REG_STARTENDThe string searched starts - at buf + pmatch[0].rm_so and ends at buf + pmatch[0].rm_eo. 
- -


- -

Finally regfree frees all the memory that was allocated -by regcomp.

- -

Footnote: this is an abridged reference to the POSIX API -functions, it is provided for compatibility with other libraries, -rather than an API to be used in new code (unless you need access -from a language other than C++). This version of these functions -should also happily coexist with other versions, as the names -used are macros that expand to the actual function names.
-

- -
- -

Copyright Dr -John Maddock 1998-2000 all rights reserved.

- - diff --git a/syntax.htm b/syntax.htm deleted file mode 100644 index 327071e5..00000000 --- a/syntax.htm +++ /dev/null @@ -1,742 +0,0 @@ - - - - - - -Regex++, Regular Expression Syntax - - - - -

 

- - - - - - -

C++ Boost

-

Regex++, Regular - Expression Syntax.

-

Copyright (c) 1998-2001

-

Dr John Maddock

-

Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty.

-
- -
- -

Regular expression syntax

- -

This section covers the regular expression syntax used by this -library, this is a programmers guide, the actual syntax presented -to your program's users will depend upon the flags used during -expression compilation.

- -

Literals

- -

All characters are literals except: ".", "|", -"*", "?", "+", "(", -")", "{", "}", "[", -"]", "^", "$" and "\". -These characters are literals when preceded by a "\". A -literal is a character that matches itself, or matches the result -of traits_type::translate(), where traits_type is the traits -template parameter to class reg_expression.

- -

Wildcard

- -

The dot character "." matches any single character -except : when match_not_dot_null is passed to the matching -algorithms, the dot does not match a null character; when match_not_dot_newline -is passed to the matching algorithms, then the dot does not match -a newline character.

- -

Repeats

- -

A repeat is an expression that is repeated an arbitrary number -of times. An expression followed by "*" can be repeated -any number of times including zero. An expression followed by -"+" can be repeated any number of times, but at least -once, if the expression is compiled with the flag regbase::bk_plus_qm -then "+" is an ordinary character and "\+" -represents a repeat of once or more. An expression followed by -"?" may be repeated zero or one times only, if the -expression is compiled with the flag regbase::bk_plus_qm then -"?" is an ordinary character and "\?" -represents the repeat zero or once operator. When it is necessary -to specify the minimum and maximum number of repeats explicitly, -the bounds operator "{}" may be used, thus "a{2}" -is the letter "a" repeated exactly twice, "a{2,4}" -represents the letter "a" repeated between 2 and 4 -times, and "a{2,}" represents the letter "a" -repeated at least twice with no upper limit. Note that there must -be no white-space inside the {}, and there is no upper limit on -the values of the lower and upper bounds. When the expression is -compiled with the flag regbase::bk_braces then "{" and -"}" are ordinary characters and "\{" and -"\}" are used to delimit bounds instead. All repeat -expressions refer to the shortest possible previous sub-expression: -a single character; a character set, or a sub-expression grouped -with "()" for example.

- -

Examples:

- -

"ba*" will match all of "b", "ba", -"baaa" etc.

- -

"ba+" will match "ba" or "baaaa" -for example but not "b".

- -

"ba?" will match "b" or "ba".

- -

"ba{2,4}" will match "baa", "baaa" -and "baaaa".

- -

Non-greedy repeats

- -

Whenever the "extended" regular expression syntax is -in use (the default) then non-greedy repeats are possible by -appending a '?' after the repeat; a non-greedy repeat is one -which will match the shortest possible string.

- -

For example to match html tag pairs one could use something -like:

- -

"<\s*tagname[^>]*>(.*?)<\s*/tagname\s*>" -

- -

In this case $1 will contain the text between the tag pairs, -and will be the shortest possible matching string.

- -

Parenthesis

- -

Parentheses serve two purposes, to group items together into a -sub-expression, and to mark what generated the match. For example -the expression "(ab)*" would match all of the string -"ababab". The matching algorithms regex_match and regex_search each -take an instance of match_results -that reports what caused the match, on exit from these functions -the match_results -contains information both on what the whole expression matched -and on what each sub-expression matched. In the example above -match_results[1] would contain a pair of iterators denoting the -final "ab" of the matching string. It is permissible -for sub-expressions to match null strings. If a sub-expression -takes no part in a match - for example if it is part of an -alternative that is not taken - then both of the iterators that -are returned for that sub-expression point to the end of the -input string, and the matched parameter for that sub-expression -is false. Sub-expressions are indexed from left to right -starting from 1, sub-expression 0 is the whole expression.

- -

Non-Marking Parenthesis

- -

Sometimes you need to group sub-expressions with parenthesis, -but don't want the parenthesis to spit out another marked sub-expression, -in this case a non-marking parenthesis (?:expression) can be used. -For example the following expression creates no sub-expressions:

- -

"(?:abc)*"

- -

Forward Lookahead Asserts 

- -

There are two forms of these; one for positive forward -lookahead asserts, and one for negative lookahead asserts:

- -

"(?=abc)" matches zero characters only if they are -followed by the expression "abc".

- -

"(?!abc)" matches zero characters only if they are -not followed by the expression "abc".

- -

Alternatives

- -

Alternatives occur when the expression can match either one -sub-expression or another, each alternative is separated by a -"|", or a "\|" if the flag regbase::bk_vbar -is set, or by a newline character if the flag regbase::newline_alt -is set. Each alternative is the largest possible previous sub-expression; -this is the opposite behaviour from repetition operators.

- -

Examples:

- -

"a(b|c)" could match "ab" or "ac". -

- -

"abc|def" could match "abc" or "def". -

- -

Sets

- -

A set is a set of characters that can match any single -character that is a member of the set. Sets are delimited by -"[" and "]" and can contain literals, -character ranges, character classes, collating elements and -equivalence classes. Set declarations that start with "^" -contain the compliment of the elements that follow.

- -

Examples:

- -

Character literals:

- -

"[abc]" will match either of "a", "b", -or "c".

- -

"[^abc] will match any character other than "a", -"b", or "c".

- -

Character ranges:

- -

"[a-z]" will match any character in the range "a" -to "z".

- -

"[^A-Z]" will match any character other than those -in the range "A" to "Z".

- -

Note that character ranges are highly locale dependent: they -match any character that collates between the endpoints of the -range, ranges will only behave according to ASCII rules when the -default "C" locale is in effect. For example if the -library is compiled with the Win32 localization model, then [a-z] -will match the ASCII characters a-z, and also 'A', 'B' etc, but -not 'Z' which collates just after 'z'. This locale specific -behaviour can be disabled by specifying regbase::nocollate when -compiling, this is the default behaviour when using regbase::normal, -and forces ranges to collate according to ASCII character code. -Likewise, if you use the POSIX C API functions then setting -REG_NOCOLLATE turns off locale dependent collation.

- -

Character classes are denoted using the syntax "[:classname:]" -within a set declaration, for example "[[:space:]]" is -the set of all whitespace characters. Character classes are only -available if the flag regbase::char_classes is set. The available -character classes are:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 alnumAny alpha numeric character. 
 alphaAny alphabetical character a-z - and A-Z. Other characters may also be included depending - upon the locale. 
 blankAny blank character, either - a space or a tab. 
 cntrlAny control character. 
 digitAny digit 0-9. 
 graphAny graphical character. 
 lowerAny lower case character a-z. - Other characters may also be included depending upon the - locale. 
 printAny printable character. 
 punctAny punctuation character. 
 spaceAny whitespace character. 
 upperAny upper case character A-Z. - Other characters may also be included depending upon the - locale. 
 xdigitAny hexadecimal digit - character, 0-9, a-f and A-F. 
 wordAny word character - all - alphanumeric characters plus the underscore. 
 unicodeAny character whose code is - greater than 255, this applies to the wide character - traits classes only. 
- -

There are some shortcuts that can be used in place of the -character classes, provided the flag regbase::escape_in_lists is -set then you can use:

- -

\w in place of [:word:]

- -

\s in place of [:space:]

- -

\d in place of [:digit:]

- -

\l in place of [:lower:]

- -

\u in place of [:upper:]

- -

Collating elements take the general form [.tagname.] inside a -set declaration, where tagname is either a single -character, or a name of a collating element, for example [[.a.]] -is equivalent to [a], and [[.comma.]] is equivalent to [,]. The -library supports all the standard POSIX collating element names, -and in addition the following digraphs: "ae", "ch", -"ll", "ss", "nj", "dz", -"lj", each in lower, upper and title case variations. -Multi-character collating elements can result in the set matching -more than one character, for example [[.ae.]] would match two -characters, but note that [^[.ae.]] would only match one -character.

- -

Equivalence classes take the general form [=tagname=] inside a -set declaration, where tagname is either a single -character, or a name of a collating element, and matches any -character that is a member of the same primary equivalence class -as the collating element [.tagname.]. An equivalence class is a -set of characters that collate the same, a primary equivalence -class is a set of characters whose primary sort key are all the -same (for example strings are typically collated by character, -then by accent, and then by case; the primary sort key then -relates to the character, the secondary to the accentation, and -the tertiary to the case). If there is no equivalence class -corresponding to tagname, then [=tagname=] is exactly the -same as [.tagname.]. Unfortunately there is no locale independent -method of obtaining the primary sort key for a character, except -under Win32. For other operating systems the library will "guess" -the primary sort key from the full sort key (obtained from strxfrm), -so equivalence classes are probably best considered broken under -any operating system other than Win32.

- -

To include a literal "-" in a set declaration then: -make it the first character after the opening "[" or -"[^", the endpoint of a range, a collating element, or -if the flag regbase::escape_in_lists is set then precede with an -escape character as in "[\-]". To include a literal -"[" or "]" or "^" in a set then -make them the endpoint of a range, a collating element, or -precede with an escape character if the flag regbase::escape_in_lists -is set.

- -

Line anchors

- -

An anchor is something that matches the null string at the -start or end of a line: "^" matches the null string at -the start of a line, "$" matches the null string at the -end of a line.

- -

Back references

- -

A back reference is a reference to a previous sub-expression -that has already been matched, the reference is to what the sub-expression -matched, not to the expression itself. A back reference consists -of the escape character "\" followed by a digit "1" -to "9", "\1" refers to the first sub-expression, -"\2" to the second etc. For example the expression -"(.*)\1" matches any string that is repeated about its -mid-point for example "abcabc" or "xyzxyz". A -back reference to a sub-expression that did not participate in -any match, matches the null string: NB this is different to some -other regular expression matchers. Back references are only -available if the expression is compiled with the flag regbase::bk_refs -set.

- -

Characters by code

- -

This is an extension to the algorithm that is not available in -other libraries, it consists of the escape character followed by -the digit "0" followed by the octal character code. For -example "\023" represents the character whose octal -code is 23. Where ambiguity could occur use parentheses to break -the expression up: "\0103" represents the character -whose code is 103, "(\010)3 represents the character 10 -followed by "3". To match characters by their -hexadecimal code, use \x followed by a string of hexadecimal -digits, optionally enclosed inside {}, for example \xf0 or -\x{aff}, notice the latter example is a Unicode character.

- -

Word operators

- -

The following operators are provided for compatibility with -the GNU regular expression library.

- -

"\w" matches any single character that is a member -of the "word" character class, this is identical to the -expression "[[:word:]]".

- -

"\W" matches any single character that is not a -member of the "word" character class, this is identical -to the expression "[^[:word:]]".

- -

"\<" matches the null string at the start of a -word.

- -

"\>" matches the null string at the end of the -word.

- -

"\b" matches the null string at either the start or -the end of a word.

- -

"\B" matches a null string within a word.

- -

The start of the sequence passed to the matching algorithms is -considered to be a potential start of a word unless the flag -match_not_bow is set. The end of the sequence passed to the -matching algorithms is considered to be a potential end of a word -unless the flag match_not_eow is set.

- -

Buffer operators

- -

The following operators are provide for compatibility with the -GNU regular expression library, and Perl regular expressions:

- -

"\`" matches the start of a buffer.

- -

"\A" matches the start of the buffer.

- -

"\'" matches the end of a buffer.

- -

"\z" matches the end of a buffer.

- -

"\Z" matches the end of a buffer, or possibly one or -more new line characters followed by the end of the buffer.

- -

A buffer is considered to consist of the whole sequence passed -to the matching algorithms, unless the flags match_not_bob or -match_not_eob are set.

- -

Escape operator

- -

The escape character "\" has several meanings.

- -

Inside a set declaration the escape character is a normal -character unless the flag regbase::escape_in_lists is set in -which case whatever follows the escape is a literal character -regardless of its normal meaning.

- -

The escape operator may introduce an operator for example: -back references, or a word operator.

- -

The escape operator may make the following character normal, -for example "\*" represents a literal "*" -rather than the repeat operator.

- -

Single character escape sequences

- -

The following escape sequences are aliases for single -characters:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 Escape sequence Character code Meaning  
 \a 0x07 Bell character.  
 \f 0x0C Form feed.  
 \n 0x0A Newline character.  
 \r 0x0D Carriage return.  
 \t 0x09 Tab character.  
 \v 0x0B Vertical tab.  
 \e 0x1B ASCII Escape character.  
 \0dd 0dd An octal character code, - where dd is one or more octal digits.  
 \xXX 0xXX A hexadecimal character - code, where XX is one or more hexadecimal digits.  
 \x{XX} 0xXX A hexadecimal character - code, where XX is one or more hexadecimal digits, - optionally a unicode character.  
 \cZ z-@ An ASCII escape sequence - control-Z, where Z is any ASCII character greater than or - equal to the character code for '@'.  
- -


- -

Miscellaneous escape sequences:

- -

The following are provided mostly for perl compatibility, but -note that there are some differences in the meanings of \l \L \u -and \U:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 \w Equivalent to [[:word:]].  
 \W Equivalent to [^[:word:]].  
 \s Equivalent to [[:space:]].  
 \S Equivalent to [^[:space:]].  
 \d Equivalent to [[:digit:]].  
 \D Equivalent to [^[:digit:]].  
 \l Equivalent to [[:lower:]].  
 \L Equivalent to [^[:lower:]].  
 \u Equivalent to [[:upper:]].  
 \U Equivalent to [^[:upper:]].  
 \C Any single character, - equivalent to '.'.  
 \X Match any Unicode combining - character sequence, for example "a\x 0301" (a - letter a with an acute).  
 \Q The begin quote operator, - everything that follows is treated as a literal character - until a \E end quote operator is found.  
 \E The end quote operator, - terminates a sequence begun with \Q.  
- -


- -

What gets matched?

- -

The regular expression library will match the first possible -matching string, if more than one string starting at a given -location can match then it matches the longest possible string, -unless the flag match_any is set, in which case the first match -encountered is returned. Use of the match_any option can reduce -the time taken to find the match - but is only useful if the user -is less concerned about what matched - for example it would not -be suitable for search and replace operations. In cases where -their are multiple possible matches all starting at the same -location, and all of the same length, then the match chosen is -the one with the longest first sub-expression, if that is the -same for two or more matches, then the second sub-expression will -be examined and so on.
-

- -
- -

Copyright Dr -John Maddock 1998-2000 all rights reserved.

- - diff --git a/template_class_ref.htm b/template_class_ref.htm deleted file mode 100644 index ccd0d3c9..00000000 --- a/template_class_ref.htm +++ /dev/null @@ -1,2479 +0,0 @@ - - - - - - -Regex++, template class and algorithm reference - - - - -

 

- - - - - - -

C++ Boost

-

Regex++ template - class reference.

-

Copyright (c) 1998-2001

-

Dr John Maddock

-

Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty.

-
- -
- -

class regbase

- -

#include <boost/regex.hpp> -

- -

Class regbase is the template argument independent base class -for reg_expression, the only public members are the flag_type -enumerated values that determine how regular expressions are -interpreted.

- -
class regbase
-{
-public:
-   enum flag_type_
-   {
-      escape_in_lists = 1,                          // '\\' special inside [...] 
-      char_classes = escape_in_lists << 1,          // [[:CLASS:]] allowed 
-      intervals = char_classes << 1,                // {x,y} allowed 
-      limited_ops = intervals << 1,                 // all of + ? and | are normal characters 
-      newline_alt = limited_ops << 1,               // \n is the same as | 
-      bk_plus_qm = newline_alt << 1,                // uses \+ and \? 
-      bk_braces = bk_plus_qm << 1,                  // uses \{ and \} 
-      bk_parens = bk_braces << 1,                   // uses \( and \) 
-      bk_refs = bk_parens << 1,                     // \d allowed 
-      bk_vbar = bk_refs << 1,                       // uses \| 
-      use_except = bk_vbar << 1,                    // exception on error 
-      failbit = use_except << 1,                    // error flag 
-      literal = failbit << 1,                       // all characters are literals 
-      icase = literal << 1,                         // characters are matched regardless of case 
-      nocollate = icase << 1,                       // don't use locale specific collation 
-
-      basic = char_classes | intervals | limited_ops | bk_braces | bk_parens | bk_refs,
-      extended = char_classes | intervals | bk_refs,
-      normal = escape_in_lists | char_classes | intervals | bk_refs | nocollate,
-      emacs = bk_braces | bk_parens | bk_refs | bk_vbar,
-      awk = extended | escape_in_lists,
-      grep = basic | newline_alt,
-      egrep = extended | newline_alt,
-      sed = basic,
-      perl = normal
-   }; 
-   typedef unsigned int flag_type;
-};   
- -

 

- -

The enumerated type regbase::flag_type determines the -syntax rules for regular expression compilation, the various -flags have the following effects:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 regbase::escape_in_listsAllows the use of the escape - "\" character in sets of characters, for - example [\]] represents the set of characters containing - only "]". If this flag is not set then "\" - is an ordinary character inside sets. 
 regbase::char_classesWhen this bit is set, - character classes [:classname:] are allowed inside - character set declarations, for example "[[:word:]]" - represents the set of all characters that belong to the - character class "word". 
 regbase:: intervalsWhen this bit is set, - repetition intervals are allowed, for example "a{2,4}" - represents a repeat of between 2 and 4 letter a's. 
 regbase:: limited_opsWhen this bit is set all of - "+", "?" and "|" are - ordinary characters in all situations. 
 regbase:: newline_altWhen this bit is set, then - the newline character "\n" has the same effect - as the alternation operator "|". 
 regbase:: bk_plus_qmWhen this bit is set then - "\+" represents the one or more repetition - operator and "\?" represents the zero or one - repetition operator. When this bit is not set then - "+" and "?" are used instead. 
 regbase:: bk_bracesWhen this bit is set then - "\{" and "\}" are used for bounded - repetitions and "{" and "}" are - normal characters. This is the opposite of default - behavior. 
 regbase:: bk_parensWhen this bit is set then - "\(" and "\)" are used to group sub-expressions - and "(" and ")" are ordinary - characters, this is the opposite of default behaviour. 
 regbase:: bk_refsWhen this bit is set then - back references are allowed. 
 regbase:: bk_vbarWhen this bit is set then - "\|" represents the alternation operator and - "|" is an ordinary character. This is the - opposite of default behaviour. 
 regbase:: use_exceptWhen this bit is set then a bad_expression exception will - be thrown on error.  Use of this flag is deprecated - - reg_expression will always throw on error. 
 regbase:: failbitThis bit is set on error, if - regbase::use_except is not set, then this bit should be - checked to see if a regular expression is valid before - usage. 
 regbase::literalAll characters in the string - are treated as literals, there are no special characters - or escape sequences. 
 regbase::icaseAll characters in the string - are matched regardless of case. 
 regbase::nocollateLocale specific collation is - disabled when dealing with ranges in character set - declarations.  For example when this bit is set the - expression [a-c] would match the characters a, b and c - only regardless of locale, where as when this is not set - , then [a-c] matches any character which collates in the - range a to c. 
 regbase::basicEquivalent to the POSIX - basic regular expression syntax: char_classes | intervals - | limited_ops | bk_braces | bk_parens | bk_refs. 
 Regbase::extendedEquivalent to the POSIX - extended regular expression syntax: char_classes | - intervals | bk_refs. 
 regbase::normalThis is the - default setting, and represents how most people expect - the library to behave. Equivalent to the POSIX extended - syntax, but with locale specific collation disabled, and - escape characters inside set declarations enabled: - regbase::escape_in_lists | regbase::char_classes | - regbase::intervals | regbase::bk_refs | regbase::nocollate. 
 regbase::emacsProvides - compatability with the emacs editor, eqivalent to: - bk_braces | bk_parens | bk_refs | bk_vbar. 
 regbase::awk Provides - compatabilty with the Unix utility Awk, the same as POSIX - extended regular expressions, but allows escapes inside - bracket-expressions (character sets). Equivalent to - extended | escape_in_lists. 
 regbase::grepProvides - compatabilty with the Unix grep utility, the same as - POSIX basic regular expressions, but with the newline - character equivalent to the alternation operator. the - same as basic | newline_alt. 
 regbase::egrepProvides - compatabilty with the Unix egrep utility, the same as - POSIX extended regular expressions, but with the newline - character equivalent to the alternation operator. the - same as extended | newline_alt. 
 regbase::sedProvides - compatabilty with the Unix sed utility, the same as POSIX - basic regular expressions. 
 regbase::perlProvides - compatibility with the perl programming language, the - same as regbase::normal. 
- -
- -

Exception classes.

- -

#include <boost/pat_except.hpp> -

- -

An instance of bad_expression is thrown whenever a bad -regular expression is encountered.

- -
namespace boost{
-
-class bad_pattern : public std::runtime_error
-{
-public:
-   explicit bad_pattern(const std::string& s) : std::runtime_error(s){};
-};
-
-class bad_expression : public bad_pattern
-{
-public:
-   bad_expression(const std::string& s) : bad_pattern(s) {}
-};
-
-
-} // namespace boost
- -

Footnotes: the class bad_pattern forms the base class -for all pattern-matching exceptions, of which bad_expression -is one. The choice of std::runtime_error as the base class -for bad_pattern is moot, depending upon how the library is -used exceptions may be either logic errors (programmer supplied -expressions) or run time errors (user supplied expressions).

- -
- -

Class reg_expression

- -

#include <boost/regex.hpp> -

- -

The template class reg_expression encapsulates regular -expression parsing and compilation. The class derives from class regbase and takes three template -parameters:

- -

charT: determines the character type, i.e. -either char or wchar_t.

- -

traits: determines the behaviour of the -character type, for example whether character matching is case -sensitive or not, and which character class names are recognized. -A default traits class is provided: regex_traits<charT>. -

- -

Allocator: the allocator class used to allocate -memory by the class.

- -

For ease of use there are two typedefs that define the two -standard reg_expression instances, unless you want to use -custom allocators, you won't need to use anything other than -these:

- -
namespace boost{
-template <class charT, class traits = regex_traits<charT>, class Allocator = std::allocator<charT>  >
-class reg_expression;
-typedef reg_expression<char> regex;
-typedef reg_expression<wchar_t> wregex;
-}
- -

The definition of reg_expression follows: it is based -very closely on class basic_string, and fulfils the requirements -for a container of charT.

- -
namespace boost{
-template <class charT, class traits = regex_traits<charT>, class Allocator = std::allocator<charT>  >
-class reg_expression : public regbase
-{
-public: 
-   // typedefs:  
-   typedef charT char_type; 
-   typedef traits traits_type; 
-   // locale_type 
-   // placeholder for actual locale type used by the 
-   // traits class to localise *this. 
-   typedef typename traits::locale_type locale_type; 
-   // value_type 
-   typedef charT value_type; 
-   // reference, const_reference 
-   typedef charT& reference; 
-   typedef const charT& const_reference; 
-   // iterator, const_iterator 
-   typedef const charT* const_iterator; 
-   typedef const_iterator iterator; 
-   // difference_type 
-   typedef typename Allocator::difference_type difference_type; 
-   // size_type 
-   typedef typename Allocator::size_type size_type; 
-   // allocator_type 
-   typedef Allocator allocator_type; 
-   typedef Allocator alloc_type; 
-   // flag_type 
-   typedef boost::int_fast32_t flag_type; 
-public: 
-   // constructorsexplicit reg_expression(const Allocator& a = Allocator()); 
-   explicit reg_expression(const charT* p, flag_type f = regbase::normal, const Allocator& a = Allocator()); 
-   reg_expression(const charT* p1, const charT* p2, flag_type f = regbase::normal, const Allocator& a = Allocator()); 
-   reg_expression(const charT* p, size_type len, flag_type f, const Allocator& a = Allocator()); 
-   reg_expression(const reg_expression&); 
-   template <class ST, class SA> 
-   explicit reg_expression(const std::basic_string<charT, ST, SA>& p, flag_type f = regbase::normal, const Allocator& a = Allocator()); 
-   template <class I> 
-   reg_expression(I first, I last, flag_type f = regbase::normal, const Allocator& a = Allocator()); 
-   ~reg_expression(); 
-   reg_expression& operator=(const reg_expression&); 
-   reg_expression& operator=(const charT* ptr); 
-   template <class ST, class SA> 
-   reg_expression& operator=(const std::basic_string<charT, ST, SA>& p); 
-   // 
-   // assign: 
-   reg_expression& assign(const reg_expression& that); 
-   reg_expression& assign(const charT* ptr, flag_type f = regbase::normal); 
-   reg_expression& assign(const charT* first, const charT* last, flag_type f = regbase::normal); 
-   template <class string_traits, class A> 
-   reg_expression& assign( 
-       const std::basic_string<charT, string_traits, A>& s, 
-       flag_type f = regbase::normal); 
-   template <class iterator> 
-   reg_expression& assign(iterator first, 
-                          iterator last, 
-                          flag_type f = regbase::normal); 
-   // 
-   // allocator access: 
-   Allocator get_allocator()const; 
-   // 
-   // locale: 
-   locale_type imbue(locale_type l); 
-   locale_type getloc()const; 
-   // 
-   // flags: 
-   flag_type getflags()const; 
-   // 
-   // str: 
-   std::basic_string<charT> str()const; 
-   // 
-   // begin, end: 
-   const_iterator begin()const; 
-   const_iterator end()const; 
-   // 
-   // swap: 
-   void swap(reg_expression&)throw(); 
-   // 
-   // size: 
-   size_type size()const; 
-   // 
-   // max_size: 
-   size_type max_size()const; 
-   // 
-   // empty: 
-   bool empty()const; 
-   unsigned mark_count()const; 
-   bool operator==(const reg_expression&)const; 
-   bool operator<(const reg_expression&)const; 
-};
-} // namespace boost 
- -

Class reg_expression has the following public member functions: -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 reg_expression(Allocator a = - Allocator()); Constructs a default - instance of reg_expression without any expression. 
 reg_expression(charT* p, unsigned - f = regbase::normal, Allocator a = Allocator()); Constructs an instance - of reg_expression from the expression denoted by the null - terminated string p, using the flags f to - determine regular expression syntax. See class regbase for allowable flag values. 
 reg_expression(charT* p1, - charT* p2, unsigned f = regbase::normal, Allocator - a = Allocator()); Constructs an instance - of reg_expression from the expression denoted by pair of - input-iterators p1 and p2, using the flags f - to determine regular expression syntax. See class regbase for allowable flag values. 
 reg_expression(charT* p, - size_type len, unsigned f, Allocator a = Allocator()); Constructs an instance - of reg_expression from the expression denoted by the - string p of length len, using the flags f - to determine regular expression syntax. See class regbase for allowable flag values. 
 template <class - ST, class SA>
- reg_expression(const std::basic_string<charT, - ST, SA>& p, boost::int_fast32_t f = regbase::normal, - const Allocator& a = Allocator());
 Constructs an instance - of reg_expression from the expression denoted by the - string p, using the flags f to determine - regular expression syntax. See class regbase - for allowable flag values.

Note - this member may not - be available depending upon your compiler capabilities.

-
 
 template <class I>
- reg_expression(I first, I last, flag_type f = regbase::normal, - const Allocator& a = Allocator());
 Constructs an instance - of reg_expression from the expression denoted by pair of - input-iterators p1 and p2, using the flags f - to determine regular expression syntax. See class regbase for allowable flag values. 
 reg_expression(const - reg_expression&);Copy constructor - copies an - existing regular expression. 
 reg_expression& operator=(const - reg_expression&);Copies an existing regular - expression. 
 reg_expression& operator=(const - charT* ptr);Equivalent to assign(ptr); 
 template <class ST, class - SA>

reg_expression& operator=(const std::basic_string<charT, - ST, SA>& p);

-
Equivalent to assign(p); 
 reg_expression& assign(const - reg_expression& that);Copies the regular - expression contained by that, throws bad_expression if that - does not contain a valid expression. Returns *this. 
 reg_expression& assign(const - charT* p, flag_type f = regbase::normal);Compiles a regular - expression from the expression denoted by the null - terminated string p, using the flags f to - determine regular expression syntax. See class regbase for allowable flag values. - Throws bad_expression if p - does not contain a valid expression. Returns *this. 
 reg_expression& assign(const - charT* first, const charT* last, flag_type f = - regbase::normal);Compiles a regular - expression from the expression denoted by the pair of - input-iterators first-last, using the flags f - to determine regular expression syntax. See class regbase for allowable flag values. - Throws bad_expression if first-last - does not contain a valid expression. Returns *this. 
 template <class - string_traits, class A>
- reg_expression& assign(const std::basic_string<charT, - string_traits, A>& s, flag_type f = regbase::normal);
Compiles a regular - expression from the expression denoted by the string s, - using the flags f to determine regular expression - syntax. See class regbase for - allowable flag values. Throws bad_expression - if s does not contain a valid expression. Returns - *this. 
 template <class - iterator>
- reg_expression& assign(iterator first, iterator last, - flag_type f = regbase::normal);
Compiles a regular - expression from the expression denoted by the pair of - input-iterators first-last, using the flags f - to determine regular expression syntax. See class regbase for allowable flag values. - Throws bad_expression if first-last - does not contain a valid expression. Returns *this. 
 Allocator get_allocator()const;Returns the allocator used - by the expression. 
 locale_type imbue(const - locale_type& l);Imbues the expression with - the specified locale, and invalidates the current - expression. May throw std::runtime_error if the call - results in an attempt to open a non-existent message - catalogue. 
 locale_type getloc()const;Returns the locale used by - the expression. 
 flag_type getflags()const;Returns the flags used to - compile the current expression. 
 std::basic_string<charT> - str()const;Returns the current - expression as a string. 
 const_iterator begin()const;Returns a pointer to the - first character of the current expression. 
 const_iterator end()const;Returns a pointer to the end - of the current expression. 
 size_type size()const;Returns the length of the - current expression. 
 size_type max_size()const;Returns the maximum length - of a regular expression text. 
 bool empty()const;Returns true if the object - contains no valid expression. 
 unsigned mark_count()const - ;Returns the number of sub-expressions - in the compiled regular expression. Note that this - includes the whole match (subexpression zero), so the - value returned is always >= 1. 
- -
- -

Class regex_traits

- -

#include <boost/regex/regex_traits.hpp> -

- -

This is a preliminary version of the regular expression -traits class, and is subject to change.

- -

The purpose of the traits class is to make it easier to -customise the behaviour of reg_expression and the -associated matching algorithms. Custom traits classes can handle -special character sets or define additional character classes, -for example one could define [[:kanji:]] as the set of all (Unicode) -kanji characters. This library provides three traits classes and -a wrapper class regex_traits, which inherits from one of -these depending upon the default localisation model in use, class -c_regex_traits encapsulates the global C locale, class w32_regex_traits -encapsulates the global Win32 locale (only available on Win32 -systems), and class cpp_regex_traits encapsulates the C++ -locale (only provided if std::locale is supported):

- -
template <class charT> class c_regex_traits;
-template<> class c_regex_traits<char> { /*details*/ };
-template<> class c_regex_traits<wchar_t> { /*details*/ };
-
-template <class charT> class w32_regex_traits;
-template<> class w32_regex_traits<char> { /*details*/ };
-template<> class w32_regex_traits<wchar_t> { /*details*/ };
-
-template <class charT> class cpp_regex_traits;
-template<> class cpp_regex_traits<char> { /*details*/ };
-template<> class cpp_regex_traits<wchar_t> { /*details*/ };
-
-template <class charT> class regex_traits : public base_type { /*detailts*/ };
- -

Where "base_type" defaults to w32_regex_traits -on Win32 systems, and c_regex_traits otherwise. The -default behaviour can be changed by defining one of -BOOST_REGEX_USE_C_LOCALE (forces use of c_regex_traits by -default), or BOOST_REGEX_USE_CPP_LOCALE (forces use of cpp_regex_traits -by default). Alternatively a specific traits class can be passed -to the reg_expression template.

- -

The requirements for custom traits classes are documented separately here....

- -

There is also an example of a custom traits class supplied by Christian Engström, -see iso8859_1_regex_traits.cpp -and iso8859_1_regex_traits.hpp, -see the -readme file for more details.

- -
- -

Class match_results

- -

#include <boost/regex.hpp> -

- -

Regular expressions are different from many simple pattern-matching -algorithms in that as well as finding an overall match they can -also produce sub-expression matches: each sub-expression being -delimited in the pattern by a pair of parenthesis (...). There -has to be some method for reporting sub-expression matches back -to the user: this is achieved this by defining a class match_results -that acts as an indexed collection of sub-expression matches, -each sub-expression match being contained in an object of type sub_match. -

- -
// 
-// class sub_match: 
-// denotes one sub-expression match. 
-//         
-template <class iterator>
-struct sub_match
-{
-   typedef typename std::iterator_traits<iterator>::value_type       value_type;
-   typedef typename std::iterator_traits<iterator>::difference_type  difference_type;
-   typedef iterator                                                  iterator_type;
-   
-   iterator first;
-   iterator second;
-   bool matched;
-
-   operator std::basic_string<value_type>()const;
-
-   bool operator==(const sub_match& that)const;
-   bool operator !=(const sub_match& that)const;
-   difference_type length()const;
-};
-
-// 
-// class match_results: 
-// contains an indexed collection of matched sub-expressions. 
-// 
-template <class iterator, class Allocator = std::allocator<typename std::iterator_traits<iterator>::value_type > > 
-class match_results 
-{ 
-public: 
-   typedef Allocator                                                 alloc_type; 
-   typedef typename Allocator::template Rebind<iterator>::size_type  size_type; 
-   typedef typename std::iterator_traits<iterator>::value_type       char_type; 
-   typedef sub_match<iterator>                                       value_type; 
-   typedef typename std::iterator_traits<iterator>::difference_type  difference_type; 
-   typedef iterator                                                  iterator_type; 
-   explicit match_results(const Allocator& a = Allocator()); 
-   match_results(const match_results& m); 
-   match_results& operator=(const match_results& m); 
-   ~match_results(); 
-   size_type size()const; 
-   const sub_match<iterator>& operator[](int n) const; 
-   Allocator allocator()const; 
-   difference_type length(int sub = 0)const; 
-   difference_type position(unsigned int sub = 0)const; 
-   unsigned int line()const; 
-   iterator line_start()const; 
-   std::basic_string<char_type> str(int sub = 0)const; 
-   void swap(match_results& that); 
-   bool operator==(const match_results& that)const; 
-   bool operator<(const match_results& that)const; 
-};
-typedef match_results<const char*> cmatch;
-typedef match_results<const wchar_t*> wcmatch; 
-typedef match_results<std::string::const_iterator> smatch;
-typedef match_results<std::wstring::const_iterator> wsmatch; 
- -

Class match_results is used for reporting what matched a -regular expression, it is passed to the matching algorithms regex_match and regex_search, -and is used by regex_grep to notify the -callback function (or function object) what matched. Note that -the default allocator parameter has been chosen to match the -default allocator parameter to reg_expression. match_results has -the following public member functions:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 match_results(Allocator a = - Allocator());Constructs an instance of - match_results, using allocator instance a. 
 match_results(const - match_results& m);Copy constructor. 
 match_results& operator=(const - match_results& m);Assignment operator. 
 const - sub_match<iterator>& operator[](size_type - n) const;Returns what matched, item 0 - represents the whole string, item 1 the first sub-expression - and so on. 
 Allocator& allocator()const;Returns the allocator used - by the class. 
 difference_type length(unsigned - int sub = 0);Returns the length of the - matched subexpression, defaults to the length of the - whole match, in effect this is equivalent to operator[](sub).second - - operator[](sub).first. 
 difference_type position(unsigned - int sub = 0);Returns the position of the - matched sub-expression, defaults to the position of the - whole match. The returned value is the position of the - match relative to the start of the string. 
 unsigned int - line()const;Returns the index of the - line on which the match occurred, indices start with 1, - not zero. Equivalent to the number of newline characters - prior to operator[](0).first plus one. 
 iterator line_start()const;Returns an iterator denoting - the start of the line on which the match occurred. 
 size_type size()const;Returns how many sub-expressions - are present in the match, including sub-expression zero (the - whole match). This is the case even if no matches were - found in the search operation - you must use the returned - value from regex_search / regex_match to determine whether - any match occured. 
- -


- -

The operator[] member function needs further explanation: it -returns a const reference to a structure of type -sub_match<iterator>, which has the following public members: -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 typedef typename - std::iterator_traits<iterator>::value_type - value_type;The type pointed to by the - iterators. 
 typedef typename - std::iterator_traits<iterator>::difference_type - difference_type;A type that represents the - difference between two iterators. 
 typedef iterator - iterator_type;The iterator type. 
 iterator firstAn iterator denoting the - position of the start of the match. 
 iterator secondAn iterator denoting the - position of the end of the match. 
 bool matchedA Boolean value denoting - whether this sub-expression participated in the match. 
 difference_type length()const;Returns the length of the - sub-expression match. 
 operator std::basic_string<value_type> - ()const;Converts the sub-expression - match into an instance of std::basic_string<>. Note - that this member may be either absent, or present to a - more limited degree depending upon your compiler - capabilities. 
- -

Operator[] takes an integer as an argument that denotes the -sub-expression for which to return information, the argument can -take the following special values:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 -2Returns everything from the - end of the match, to the end of the input string, - equivalent to $' in perl. If this is a null string, then: -

first == second

-

And

-

matched == false.

-
 
 -1Returns everything from the - start of the input string (or the end of the last match - if this is a grep operation), to the start of this match. - Equivalent to $` in perl. If this is a null string, then: -

first == second

-

And

-

matched == false.

-
 
 0Returns the whole of what - matched, equivalent to $& in perl. The matched - parameter is always true. 
 0 < N < size()Returns what matched sub-expression - N, if this sub-expression did not participate in the - match then 

matched == false

-

otherwise:

-

matched == true.

-
 
 N < -2 or N >= size()Represents an out-of range - non-existent sub-expression. Returns a "null" - match in which

first == last

-

And

-

matched == false.

-
 
- -

Note that as well as being parameterised for an allocator, -match_results<> also takes an iterator type, this allows -any pair of iterators to be searched for a given regular -expression, provided the iterators have at least bi-directional -properties.

- -
- -

Algorithm regex_match

- -

#include <boost/regex.hpp> -

- -

The algorithm regex _match determines whether a given regular -expression matches a given sequence denoted by a pair of -bidirectional-iterators, the algorithm is defined as follows, note -that the result is true only if the expression matches the whole -of the input sequence, the main use of this function is data -input validation:

- -
template <class iterator, class Allocator, class charT, class traits, class Allocator2>
-bool regex_match(iterator first, 
-                 iterator last, 
-                 match_results<iterator, Allocator>& m, 
-                 const reg_expression<charT, traits, Allocator2>& e, 
-                 unsigned flags = match_default);
- -

The library also defines the following convenience versions, -which take either a const charT*, or a const std::basic_string<>& -in place of a pair of iterators [note - these versions may not be -available, or may be available in a more limited form, depending -upon your compilers capabilities]:

- -
template <class charT, class Allocator, class traits, class Allocator2>
-bool regex_match(const charT* str, 
-                 match_results<const charT*, Allocator>& m, 
-                 const reg_expression<charT, traits, Allocator2>& e, 
-                 unsigned flags = match_default)
-
-template <class ST, class SA, class Allocator, class charT, class traits, class Allocator2>
-bool regex_match(const std::basic_string<charT, ST, SA>& s, 
-                 match_results<typename std::basic_string<charT, ST, SA>::const_iterator, Allocator>& m, 
-                 const reg_expression<charT, traits, Allocator2>& e, 
-                 unsigned flags = match_default);
- -

Finally there is a set of convenience versions that simply -return true or false and do not indicate what matched:

- -
template <class iterator, class Allocator, class charT, class traits, class Allocator2>
-bool regex_match(iterator first, 
-                 iterator last, 
-                 const reg_expression<charT, traits, Allocator2>& e, 
-                 unsigned flags = match_default);
-
-template <class charT, class Allocator, class traits, class Allocator2>
-bool regex_match(const charT* str, 
-                 const reg_expression<charT, traits, Allocator2>& e, 
-                 unsigned flags = match_default)
-
-template <class ST, class SA, class Allocator, class charT, class traits, class Allocator2>
-bool regex_match(const std::basic_string<charT, ST, SA>& s, 
-                 const reg_expression<charT, traits, Allocator2>& e, 
-                 unsigned flags = match_default);
- -

The parameters for the main function version are as follows:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 iterator firstDenotes the start of the range to be matched. 
 iterator lastDenotes the end of the range - to be matched. 
 match_results<iterator, - Allocator>& mAn instance of match_results - in which what matched will be reported. On exit if a - match occurred then m[0] denotes the whole of the string - that matched, m[0].first must be equal to first, m[0].second - will be less than or equal to last. m[1] denotes the - first subexpression m[2] the second subexpression and so - on. If no match occurred then m[0].first = m[0].second = - last.

Note that since the match_results structure - stores only iterators, and not strings, the iterators/strings - passed to regex_match must be valid for as long as the - result is to be used. For that reason never pass - temporary string objects to regex_match.

-
 
 const - reg_expression<charT, traits, Allocator2>& eContains the regular - expression to be matched. 
 unsigned flags = - match_defaultDetermines the semantics - used for matching, a combination of one or more match_flags enumerators. 
- -

regex_match returns false if no match occurs or true if it -does. A match only occurs if it starts at first and -finishes at last. Example: the following example -processes an ftp response:

- -
#include <stdlib.h> 
-#include <boost/regex.hpp> 
-#include <string> 
-#include <iostream> 
-
-using namespace boost; 
-
-regex expression("([0-9]+)(\\-| |$)(.*)"); 
-
-// process_ftp: 
-// on success returns the ftp response code, and fills 
-// msg with the ftp response message. 
-int process_ftp(const char* response, std::string* msg) 
-{ 
-   cmatch what; 
-   if(regex_match(response, what, expression)) 
-   { 
-      // what[0] contains the whole string 
-      // what[1] contains the response code 
-      // what[2] contains the separator character 
-      // what[3] contains the text message. 
-      if(msg) 
-         msg->assign(what[3].first, what[3].second); 
-      return std::atoi(what[1].first); 
-   } 
-   // failure did not match 
-   if(msg) 
-      msg->erase(); 
-   return -1; 
-}
- -

The value of the flags parameter -passed to the algorithm must be a combination of one or more of -the following values:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 match_defaultThe default value, indicates - that first represents the start of a line, the - start of a buffer, and (possibly) the start of a word. - Also implies that last represents the end of a - line, the end of the buffer and (possibly) the end of a - word. Implies that a dot sub-expression "." - will match both the newline character and a null. 
 match_not_bolWhen this flag is set then first - does not represent the start of a new line. 
 match_not_eolWhen this flag is set then last - does not represent the end of a line. 
 match_not_bobWhen this flag is set then first - is not the beginning of a buffer. 
 match_not_eobWhen this flag is set then last - does not represent the end of a buffer. 
 match_not_bowWhen this flag is set then first - can never match the start of a word. 
 match_not_eowWhen this flag is set then last - can never match the end of a word. 
 match_not_dot_newlineWhen this flag is set then a - dot expression "." can not match the newline - character. 
 match_not_dot_nullWhen this flag is set then a - dot expression "." can not match a null - character. 
 match_prev_availWhen this flag - is set, then *--first is a valid expression and - the flags match_not_bol and match_not_bow have no effect, - since the value of the previous character can be used to - check these. 
 match_anyWhen this flag - is set, then the first string matched is returned, rather - than the longest possible match. This flag can - significantly reduce the time taken to find a match, but - what matches is undefined. 
 match_not_nullWhen this flag - is set, then the expression will never match a null - string. 
 match_continuousWhen this flags - is set, then during a grep operation, each successive - match must start from where the previous match finished. 
 match_partialWhen this flag - is set, the regex algorithms will report partial matches - that is - where one or more characters at the end of the text input - matched some prefix of the regular expression. 
- -

 

- -
- -

Algorithm regex_search

- -

 #include <boost/regex.hpp> -

- -

The algorithm regex_search will search a range denoted by a -pair of bidirectional-iterators for a given regular expression. -The algorithm uses various heuristics to reduce the search time -by only checking for a match if a match could conceivably start -at that position. The algorithm is defined as follows:

- -
template <class iterator, class Allocator, class charT, class traits, class Allocator2>
-bool regex_search(iterator first, 
-                iterator last, 
-                match_results<iterator, Allocator>& m, 
-                const reg_expression<charT, traits, Allocator2>& e, 
-                unsigned flags = match_default);
- -

The library also defines the following convenience versions, -which take either a const charT*, or a const std::basic_string<>& -in place of a pair of iterators [note - these versions may not be -available, or may be available in a more limited form, depending -upon your compilers capabilities]:

- -
template <class charT, class Allocator, class traits, class Allocator2>
-bool regex_search(const charT* str, 
-                match_results<const charT*, Allocator>& m, 
-                const reg_expression<charT, traits, Allocator2>& e, 
-                unsigned flags = match_default);
-
-template <class ST, class SA, class Allocator, class charT, class traits, class Allocator2>
-bool regex_search(const std::basic_string<charT, ST, SA>& s, 
-                match_results<typename std::basic_string<charT, ST, SA>::const_iterator, Allocator>& m, 
-                const reg_expression<charT, traits, Allocator2>& e, 
-                unsigned flags = match_default);
- -

The parameters for the main function version are as follows:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 iterator firstThe starting position of the - range to search. 
 iterator lastThe ending position of the - range to search. 
 match_results<iterator, - Allocator>& mAn instance of match_results - in which what matched will be reported. On exit if a - match occurred then m[0] denotes the whole of the string - that matched, m[0].first and m[0].second will be less - than or equal to last. m[1] denotes the first sub-expression - m[2] the second sub-expression and so on. If no match - occurred then m[0].first = m[0].second = last.

Note - that since the match_results structure stores only - iterators, and not strings, the iterators/strings passed - to regex_search must be valid for as long as the result - is to be used. For that reason never pass temporary - string objects to regex_search.

-
 
 const - reg_expression<charT, traits, Allocator2>& eThe regular expression to - search for. 
 unsigned flags = - match_defaultThe flags that determine - what gets matched, a combination of one or more match_flags enumerators. 
- -


- -

Example: the following example, -takes the contents of a file in the form of a string, and -searches for all the C++ class declarations in the file. The code -will work regardless of the way that std::string is implemented, -for example it could easily be modified to work with the SGI rope -class, which uses a non-contiguous storage strategy.

- -
#include <string> 
-#include <map> 
-#include <boost/regex.hpp> 
-
-// purpose: 
-// takes the contents of a file in the form of a string 
-// and searches for all the C++ class definitions, storing 
-// their locations in a map of strings/int's 
-typedef std::map<std::string, int, std::less<std::string> > map_type; 
-
-boost::regex expression("^(template[[:space:]]*<[^;:{]+>[[:space:]]*)?(class|struct)[[:space:]]*(\\<\\w+\\>([[:blank:]]*\\([^)]*\\))?[[:space:]]*)*(\\<\\w*\\>)[[:space:]]*(<[^;:{]+>[[:space:]]*)?(\\{|:[^;\\{()]*\\{)"); 
-
-void IndexClasses(map_type& m, const std::string& file) 
-{ 
-   std::string::const_iterator start, end; 
-   start = file.begin(); 
-   end = file.end(); 
-      boost::match_results<std::string::const_iterator> what; 
-   unsigned int flags = boost::match_default; 
-   while(regex_search(start, end, what, expression, flags)) 
-   { 
-      // what[0] contains the whole string 
-      // what[5] contains the class name. 
-      // what[6] contains the template specialisation if any. 
-      // add class name and position to map: 
-      m[std::string(what[5].first, what[5].second) + std::string(what[6].first, what[6].second)] = 
-                what[5].first - file.begin(); 
-      // update search position: 
-      start = what[0].second; 
-      // update flags: 
-      flags |= boost::match_prev_avail; 
-      flags |= boost::match_not_bob; 
-   } 
-}
- 
- -
- -

Algorithm regex_grep

- -

#include <boost/regex.hpp> -

- -

 Regex_grep allows you to search through a bidirectional-iterator -range and locate all the (non-overlapping) matches with a given -regular expression. The function is declared as:

- -
template <class Predicate, class iterator, class charT, class traits, class Allocator>
-unsigned int regex_grep(Predicate foo, 
-                        iterator first, 
-                        iterator last, 
-                        const reg_expression<charT, traits, Allocator>& e, 
-                        unsigned flags = match_default)
- -

The library also defines the following convenience versions, -which take either a const charT*, or a const std::basic_string<>& -in place of a pair of iterators [note - these versions may not be -available, or may be available in a more limited form, depending -upon your compilers capabilities]:

- -
template <class Predicate, class charT, class Allocator, class traits>
-unsigned int regex_grep(Predicate foo, 
-              const charT* str, 
-              const reg_expression<charT, traits, Allocator>& e, 
-              unsigned flags = match_default);
-
-template <class Predicate, class ST, class SA, class Allocator, class charT, class traits>
-unsigned int regex_grep(Predicate foo, 
-              const std::basic_string<charT, ST, SA>& s, 
-              const reg_expression<charT, traits, Allocator>& e, 
-              unsigned flags = match_default);
- -

The parameters for the primary version of regex_grep have the -following meanings:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 fooA predicate function object - or function pointer, see below for more information. 
 firstThe start of the range to - search. 
 lastThe end of the range to - search. 
 eThe regular expression to - search for. 
 flagsThe flags that determine how - matching is carried out, one of the match_flags - enumerators. 
- -

 The algorithm finds all of the non-overlapping matches -of the expression e, for each match it fills a match_results<iterator, Allocator> -structure, which contains information on what matched, and calls -the predicate foo, passing the match_results<iterator, -Allocator> as a single argument. If the predicate returns -true, then the grep operation continues, otherwise it terminates -without searching for further matches. The function returns the -number of matches found.

- -

The general form of the predicate is:

- -
struct grep_predicate
-{
-   bool operator()(const match_results<iterator_type, expression_type::alloc_type>& m);
-};
- -

For example the regular expression "a*b" would find -one match in the string "aaaaab" and two in the string -"aaabb".

- -

Remember this algorithm can be used for a lot more than -implementing a version of grep, the predicate can be and do -anything that you want, grep utilities would output the results -to the screen, another program could index a file based on a -regular expression and store a set of bookmarks in a list, or a -text file conversion utility would output to file. The results of -one regex_grep can even be chained into another regex_grep to -create recursive parsers.

- -

Example: -convert the example from regex_search to use regex_grep -instead:

- -
#include <string> 
-#include <map> 
-#include <boost/regex.hpp> 
-
-// IndexClasses: 
-// takes the contents of a file in the form of a string 
-// and searches for all the C++ class definitions, storing 
-// their locations in a map of strings/int's 
-
-typedef std::map<std::string, int, std::less<std::string> > map_type; 
-
-boost::regex expression("^(template[[:space:]]*<[^;:{]+>[[:space:]]*)?" 
-                 "(class|struct)[[:space:]]*(\\<\\w+\\>([[:blank:]]*\\([^)]*\\))?[[:space:]]*)*(\\<\\w*\\>)" 
-                 "[[:space:]]*(<[^;:{]+>[[:space:]]*)?(\\{|:[^;\\{()]*\\{)"); 
-
-class IndexClassesPred 
-{ 
-   map_type& m; 
-   std::string::const_iterator base; 
-public: 
-   IndexClassesPred(map_type& a, std::string::const_iterator b) : m(a), base(b) {} 
-   bool operator()(const match_results<std::string::const_iterator, regex::alloc_type>& what) 
-   { 
-      // what[0] contains the whole string 
-      // what[5] contains the class name. 
-      // what[6] contains the template specialisation if any. 
-      // add class name and position to map: 
-      m[std::string(what[5].first, what[5].second) + std::string(what[6].first, what[6].second)] = 
-                what[5].first - base; 
-      return true; 
-   } 
-}; 
-
-void IndexClasses(map_type& m, const std::string& file) 
-{ 
-   std::string::const_iterator start, end; 
-   start = file.begin(); 
-   end = file.end(); 
-   regex_grep(IndexClassesPred(m, start), start, end, expression); 
-} 
- -

Example: -Use regex_grep to call a global callback function:

- -
#include <string> 
-#include <map> 
-#include <boost/regex.hpp> 
-
-// purpose: 
-// takes the contents of a file in the form of a string 
-// and searches for all the C++ class definitions, storing 
-// their locations in a map of strings/int's 
-
-typedef std::map<std::string, int, std::less<std::string> > map_type; 
-
-boost::regex expression("^(template[[:space:]]*<[^;:{]+>[[:space:]]*)?(class|struct)[[:space:]]*(\\<\\w+\\>([[:blank:]]*\\([^)]*\\))?[[:space:]]*)*(\\<\\w*\\>)[[:space:]]*(<[^;:{]+>[[:space:]]*)?(\\{|:[^;\\{()]*\\{)"); 
-
-map_type class_index; 
-std::string::const_iterator base; 
-
-bool grep_callback(const boost::match_results<std::string::const_iterator, boost::regex::alloc_type>& what) 
-{ 
-   // what[0] contains the whole string 
-   // what[5] contains the class name. 
-   // what[6] contains the template specialisation if any. 
-   // add class name and position to map: 
-   class_index[std::string(what[5].first, what[5].second) + std::string(what[6].first, what[6].second)] = 
-                what[5].first - base; 
-   return true; 
-} 
-
-void IndexClasses(const std::string& file) 
-{ 
-   std::string::const_iterator start, end; 
-   start = file.begin(); 
-   end = file.end(); 
-   base = start; 
-   regex_grep(grep_callback, start, end, expression, match_default); 
-}
-  
- -

Example: -use regex_grep to call a class member function, use the standard -library adapters std::mem_fun and std::bind1st to -convert the member function into a predicate:

- -
#include <string> 
-#include <map> 
-#include <boost/regex.hpp> 
-#include <functional> 
-
-// purpose: 
-// takes the contents of a file in the form of a string 
-// and searches for all the C++ class definitions, storing 
-// their locations in a map of strings/int's 
-
-typedef std::map<std::string, int, std::less<std::string> > map_type; 
-
-class class_index 
-{ 
-   boost::regex expression; 
-   map_type index; 
-   std::string::const_iterator base; 
-   bool grep_callback(boost::match_results<std::string::const_iterator, boost::regex::alloc_type> what); 
-public: 
-   void IndexClasses(const std::string& file); 
-   class_index() 
-      : index(), 
-        expression("^(template[[:space:]]*<[^;:{]+>[[:space:]]*)?" 
-                   "(class|struct)[[:space:]]*(\\<\\w+\\>([[:blank:]]*\\([^)]*\\))?" 
-                   "[[:space:]]*)*(\\<\\w*\\>)[[:space:]]*(<[^;:{]+>[[:space:]]*)?" 
-                   "(\\{|:[^;\\{()]*\\{)" 
-                   ){} 
-}; 
-
-bool class_index::grep_callback(boost::match_results<std::string::const_iterator, boost::regex::alloc_type> what) 
-{ 
-   // what[0] contains the whole string 
-   // what[5] contains the class name. 
-   // what[6] contains the template specialisation if any. 
-   // add class name and position to map: 
-   index[std::string(what[5].first, what[5].second) + std::string(what[6].first, what[6].second)] = 
-               what[5].first - base; 
-   return true; 
-} 
-
-void class_index::IndexClasses(const std::string& file) 
-{ 
-   std::string::const_iterator start, end; 
-   start = file.begin(); 
-   end = file.end(); 
-   base = start; 
-   regex_grep(std::bind1st(std::mem_fun(&class_index::grep_callback), this), 
-              start, 
-              end, 
-              expression); 
-} 
-  
- -

Finally, -C++ Builder users can use C++ Builder's closure type as a -callback argument:

- -
#include <string> 
-#include <map> 
-#include <boost/regex.hpp> 
-#include <functional> 
-
-// purpose: 
-// takes the contents of a file in the form of a string 
-// and searches for all the C++ class definitions, storing 
-// their locations in a map of strings/int's 
-
-typedef std::map<std::string, int, std::less<std::string> > map_type; 
-class class_index 
-{ 
-   boost::regex expression; 
-   map_type index; 
-   std::string::const_iterator base; 
-   typedef boost::match_results<std::string::const_iterator, boost::regex::alloc_type> arg_type; 
-   bool grep_callback(const arg_type& what); 
-public: 
-   typedef bool (__closure* grep_callback_type)(const arg_type&); 
-   void IndexClasses(const std::string& file); 
-   class_index() 
-      : index(), 
-        expression("^(template[[:space:]]*<[^;:{]+>[[:space:]]*)?" 
-                   "(class|struct)[[:space:]]*(\\<\\w+\\>([[:blank:]]*\\([^)]*\\))?" 
-                   "[[:space:]]*)*(\\<\\w*\\>)[[:space:]]*(<[^;:{]+>[[:space:]]*)?" 
-                   "(\\{|:[^;\\{()]*\\{)" 
-                   ){} 
-}; 
-
-bool class_index::grep_callback(const arg_type& what) 
-{ 
-   // what[0] contains the whole string    
-// what[5] contains the class name.    
-// what[6] contains the template specialisation if any.    
-// add class name and position to map:    
-index[std::string(what[5].first, what[5].second) + std::string(what[6].first, what[6].second)] = 
-               what[5].first - base; 
-   return true; 
-} 
-
-void class_index::IndexClasses(const std::string& file) 
-{ 
-   std::string::const_iterator start, end; 
-   start = file.begin(); 
-   end = file.end(); 
-   base = start; 
-   class_index::grep_callback_type cl = &(this->grep_callback); 
-   regex_grep(cl, 
-            start, 
-            end, 
-            expression); 
-} 
- -
- -

 Algorithm regex_format

- -

#include <boost/regex.hpp> -

- -

The algorithm regex_format takes the results of a match and -creates a new string based upon a format string, -regex_format can be used for search and replace operations:

- -
template <class OutputIterator, class iterator, class Allocator, class charT>
-OutputIterator regex_format(OutputIterator out,
-                            const match_results<iterator, Allocator>& m,
-                            const charT* fmt,
-                            unsigned flags = 0);
-
-template <class OutputIterator, class iterator, class Allocator, class charT>
-OutputIterator regex_format(OutputIterator out,
-                            const match_results<iterator, Allocator>& m,
-                            const std::basic_string<charT>& fmt,
-                            unsigned flags = 0);
- -

The library also defines the following convenience variation -of regex_format, which returns the result directly as a string, -rather than outputting to an iterator [note - this version may -not be available, or may be available in a more limited form, -depending upon your compilers capabilities]:

- -
template <class iterator, class Allocator, class charT>
-std::basic_string<charT> regex_format
-                                 (const match_results<iterator, Allocator>& m, 
-                                  const charT* fmt,
-                                  unsigned flags = 0);
-
-template <class iterator, class Allocator, class charT>
-std::basic_string<charT> regex_format
-                                 (const match_results<iterator, Allocator>& m, 
-                                  const std::basic_string<charT>& fmt,
-                                  unsigned flags = 0);
- -

Parameters to the main version of the function are passed as -follows:

- - - - - - - - - - - - - - - - - - - - - - - - - - -
 OutputIterator outAn output iterator type, the - output string is sent to this iterator. Typically this - would be a std::ostream_iterator. 
 const - match_results<iterator, Allocator>& mAn instance of - match_results<> obtained from one of the matching - algorithms above, and denoting what matched. 
 const charT* fmtA format string that - determines how the match is transformed into the new - string. 
 unsigned flagsOptional flags which - describe how the format string is to be interpreted. 
- -

Format flags are defined as follows: -

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 format_allEnables all syntax options (perl-like - plus extentions). 
 format_sedAllows only a sed-like - syntax. 
 format_perlAllows only a perl-like - syntax. 
 format_no_copyDisables copying of - unmatched sections to the output string during regex_merge operations. 
 format_first_onlyWhen this flag is set only the first occurance will - be replaced (applies to regex_merge only). 
- -


- -

The format string syntax (and available options) is described -more fully under format -strings.

- -
- -

Algorithm regex_merge

- -

#include <boost/regex.hpp> -

- -

The algorithm regex_merge is a combination of regex_grep and regex_format. -That is, it greps through the string finding all the matches to -the regular expression, for each match it then calls regex_format to format the string and -sends the result to the output iterator. Sections of text that do -not match are copied to the output unchanged only if the flags -parameter does not have the flag format_no_copy -set. If the flag format_first_only is -set then only the first occurance is replaced rather than all -occurrences.

- -
template <class OutputIterator, class iterator, class traits, class Allocator, class charT>
-OutputIterator regex_merge(OutputIterator out, 
-                          iterator first,
-                          iterator last,
-                          const reg_expression<charT, traits, Allocator>& e, 
-                          const charT* fmt, 
-                          unsigned int flags = match_default);
-
-template <class OutputIterator, class iterator, class traits, class Allocator, class charT>
-OutputIterator regex_merge(OutputIterator out, 
-                           iterator first,
-                           iterator last,
-                           const reg_expression<charT, traits, Allocator>& e, 
-                           std::basic_string<charT>& fmt, 
-                           unsigned int flags = match_default);
- -

The library also defines the following convenience variation -of regex_merge, which returns the result directly as a string, -rather than outputting to an iterator [note - this version may -not be available, or may be available in a more limited form, -depending upon your compilers capabilities]:

- -
template <class traits, class Allocator, class charT>
-std::basic_string<charT> regex_merge(const std::basic_string<charT>& text,
-                                     const reg_expression<charT, traits, Allocator>& e, 
-                                     const charT* fmt, 
-                                     unsigned int flags = match_default);
-
-template <class traits, class Allocator, class charT>
-std::basic_string<charT> regex_merge(const std::basic_string<charT>& text,
-                                     const reg_expression<charT, traits, Allocator>& e, 
-                                     const std::basic_string<charT>& fmt, 
-                                     unsigned int flags = match_default);
- -

Parameters to the main version of the function are passed as -follows:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 OutputIterator outAn output iterator type, the - output string is sent to this iterator. Typically this - would be a std::ostream_iterator. 
 iterator firstThe start of the range of - text to grep (bidirectional-iterator). 
 iterator lastThe end of the range of text - to grep (bidirectional-iterator). 
 const - reg_expression<charT, traits, Allocator>& eThe expression to search for. 
 const charT* fmtThe format string to be - applied to sections of text that match. 
 unsigned int - flags = match_defaultFlags which determine how - the expression is matched - see match_flags, - and how the format string is interpreted - see format_flags. 
- -

Example: the following example takes -C/C++ source code as input, and outputs syntax highlighted HTML -code.

- -
-#include <fstream>
-#include <sstream>
-#include <string>
-#include <iterator>
-#include <boost/regex.hpp>
-#include <fstream>
-#include <iostream>
-
-// purpose:
-// takes the contents of a file and transform to
-// syntax highlighted code in html format
-
-boost::regex e1, e2;
-extern const char* expression_text;
-extern const char* format_string;
-extern const char* pre_expression;
-extern const char* pre_format;
-extern const char* header_text;
-extern const char* footer_text;
-
-void load_file(std::string& s, std::istream& is)
-{
-   s.erase();
-   s.reserve(is.rdbuf()->in_avail());
-   char c;
-   while(is.get(c))
-   {
-      if(s.capacity() == s.size())
-         s.reserve(s.capacity() * 3);
-      s.append(1, c);
-   }
-}
-
-int main(int argc, const char** argv)
-{
-   try{
-   e1.assign(expression_text);
-   e2.assign(pre_expression);
-   for(int i = 1; i < argc; ++i)
-   {
-      std::cout << "Processing file " << argv[i] << std::endl;
-      std::ifstream fs(argv[i]);
-      std::string in;
-      load_file(in, fs);
-      std::string out_name(std::string(argv[i]) + std::string(".htm"));
-      std::ofstream os(out_name.c_str());
-      os << header_text;
-      // strip '<' and '>' first by outputting to a
-      // temporary string stream
-      std::ostringstream t(std::ios::out | std::ios::binary);
-      std::ostream_iterator<char, char> oi(t);
-      boost::regex_merge(oi, in.begin(), in.end(), e2, pre_format);
-      // then output to final output stream
-      // adding syntax highlighting:
-      std::string s(t.str());
-      std::ostream_iterator<char, char> out(os);
-      boost::regex_merge(out, s.begin(), s.end(), e1, format_string);
-      os << footer_text;
-   }
-   }
-   catch(...)
-   { return -1; }
-   return 0;
-}
-
-extern const char* pre_expression = "(<)|(>)|\\r";
-extern const char* pre_format = "(?1<)(?2>)";
-
-
-const char* expression_text = // preprocessor directives: index 1
-                              "(^[[:blank:]]*#(?:[^\\\\\\n]|\\\\[^\\n[:punct:][:word:]]*[\\n[:punct:][:word:]])*)|"
-                              // comment: index 2
-                              "(//[^\\n]*|/\\*.*?\\*/)|"
-                              // literals: index 3
-                              "\\<([+-]?(?:(?:0x[[:xdigit:]]+)|(?:(?:[[:digit:]]*\\.)?[[:digit:]]+(?:[eE][+-]?[[:digit:]]+)?))u?(?:(?:int(?:8|16|32|64))|L)?)\\>|"
-                              // string literals: index 4
-                              "('(?:[^\\\\']|\\\\.)*'|\"(?:[^\\\\\"]|\\\\.)*\")|"
-                              // keywords: index 5
-                              "\\<(__asm|__cdecl|__declspec|__export|__far16|__fastcall|__fortran|__import"
-                              "|__pascal|__rtti|__stdcall|_asm|_cdecl|__except|_export|_far16|_fastcall"
-                              "|__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto|bool"
-                              "|break|case|catch|cdecl|char|class|const|const_cast|continue|default|delete"
-                              "|do|double|dynamic_cast|else|enum|explicit|extern|false|float|for|friend|goto"
-                              "|if|inline|int|long|mutable|namespace|new|operator|pascal|private|protected"
-                              "|public|register|reinterpret_cast|return|short|signed|sizeof|static|static_cast"
-                              "|struct|switch|template|this|throw|true|try|typedef|typeid|typename|union|unsigned"
-                              "|using|virtual|void|volatile|wchar_t|while)\\>"
-                              ;
-
-const char* format_string = "(?1<font color=\"#008040\">$&</font>)"
-                            "(?2<I><font color=\"#000080\">$&</font></I>)"
-                            "(?3<font color=\"#0000A0\">$&</font>)"
-                            "(?4<font color=\"#0000FF\">$&</font>)"
-                            "(?5<B>$&</B>)";
-
-const char* header_text = "<HTML>\n<HEAD>\n"
-                          "<TITLE>Auto-generated html formated source</TITLE>\n"
-                          "<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=windows-1252\">\n"
-                          "</HEAD>\n"
-                          "<BODY LINK=\"#0000ff\" VLINK=\"#800080\" BGCOLOR=\"#ffffff\">\n"
-                          "<P> </P>\n<PRE>";
-
-const char* footer_text = "</PRE>\n</BODY>\n\n";
- -
- -

Algorithm regex_split

- -

#include <boost/regex.hpp> -

- -

Algorithm regex_split performs a similar operation to the perl -split operation, and comes in three overloaded forms:

- -
template <class OutputIterator, class charT, class Traits1, class Alloc1, class Traits2, class Alloc2>
-std::size_t regex_split(OutputIterator out, 
-                        std::basic_string<charT, Traits1, Alloc1>& s, 
-                        const reg_expression<charT, Traits2, Alloc2>& e,
-                        unsigned flags,
-                        std::size_t max_split);
-
-template <class OutputIterator, class charT, class Traits1, class Alloc1, class Traits2, class Alloc2>
-std::size_t regex_split(OutputIterator out, 
-                        std::basic_string<charT, Traits1, Alloc1>& s, 
-                        const reg_expression<charT, Traits2, Alloc2>& e,
-                        unsigned flags = match_default);
-
-template <class OutputIterator, class charT, class Traits1, class Alloc1>
-std::size_t regex_split(OutputIterator out, 
-                        std::basic_string<charT, Traits1, Alloc1>& s);
- -

Each version takes an output-iterator for output, and a string -for input. If the expression contains no marked sub-expressions, -then the algorithm writes one string onto the output-iterator for -each section of input that does not match the expression. If the -expression does contain marked sub-expressions, then each time a -match is found, one string for each marked sub-expression will be -written to the output-iterator. No more than max_split strings -will be written to the output-iterator. Before returning, all the -input processed will be deleted from the string s (if max_split -is not reached then all of s will be deleted). Returns -the number of strings written to the output-iterator. If the -parameter max_split is not specified then it defaults to -UINT_MAX. If no expression is specified, then it defaults to -"\s+", and splitting occurs on whitespace.

- -

Example: -the following function will split the input string into a series -of tokens, and remove each token from the string s:

- -
unsigned tokenise(std::list<std::string>& l, std::string& s)
-{
-   return boost::regex_split(std::back_inserter(l), s);
-}
- -

Example: -the following short program will extract all of the URL's from a -html file, and print them out to cout:

- -
#include <list>
-#include <fstream>
-#include <iostream>
-#include <boost/regex.hpp>
-
-boost::regex e("<\\s*A\\s+[^>]*href\\s*=\\s*\"([^\"]*)\"",
-               boost::regbase::normal | boost::regbase::icase);
-
-void load_file(std::string& s, std::istream& is)
-{
-   s.erase();
-   //
-   // attempt to grow string buffer to match file size,
-   // this doesn't always work...
-   s.reserve(is.rdbuf()-&gtin_avail());
-   char c;
-   while(is.get(c))
-   {
-      // use logarithmic growth stategy, in case
-      // in_avail (above) returned zero:
-      if(s.capacity() == s.size())
-         s.reserve(s.capacity() * 3);
-      s.append(1, c);
-   }
-}
-
-
-int main(int argc, char** argv)
-{
-   std::string s;
-   std::list<std::string> l;
-
-   for(int i = 1; i < argc; ++i)
-   {
-      std::cout << "Findings URL's in " << argv[i] << ":" << std::endl;
-      s.erase();
-      std::ifstream is(argv[i]);
-      load_file(s, is);
-      boost::regex_split(std::back_inserter(l), s, e);
-      while(l.size())
-      {
-         s = *(l.begin());
-         l.pop_front();
-         std::cout << s << std::endl;
-      }
-   }
-   return 0;
-}
- -
- -

Partial Matches

- -

The match-flag match_partial can be passed to the -following algorithms: regex_match, regex_search, and regex_grep. -When used it indicates that partial as well as full matches -should be found. A partial match is one that matched one or more -characters at the end of the text input, but did not match all of -the regular expression (although it may have done so had more -input been available). Partial matches are typically used when -either validating data input (checking each character as it is -entered on the keyboard), or when searching texts that are either -too long to load into memory (or even into a memory mapped file), -or are of indeterminate length (for example the source may be a -socket or similar). Partial and full matches can be -differentiated as shown in the following table (the variable M -represents an instance of match_results<> as filled in by -regex_match, regex_search or regex_grep):
-

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 ResultM[0].matchedM[0].firstM[0].second
No matchFalseUndefinedUndefinedUndefined
Partial matchTrueFalseStart of partial match.End of partial match (end of - text).
Full matchTrueTrueStart of full match.End of full match.
- -

The following example tests -to see whether the text could be a valid credit card number, as -the user presses a key, the character entered would be added to -the string being built up, and passed to is_possible_card_number. -If this returns true then the text could be a valid card number, -so the user interface's OK button would be enabled. If it returns -false, then this is not yet a valid card number, but could be -with more input, so the user interface would disable the OK -button. Finally, if the procedure throws an exception the input -could never become a valid number, and the inputted character -must be discarded, and a suitable error indication displayed to -the user.

- -
#include <string>
-#include <iostream>
-#include <boost/regex.hpp>
-
-boost::regex e("(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})");
-
-bool is_possible_card_number(const std::string& input)
-{
-   //
-   // return false for partial match, true for full match, or throw for
-   // impossible match based on what we have so far...
-   boost::match_results<std::string::const_iterator> what;
-   if(0 == boost::regex_match(input, what, e, boost::match_default | boost::match_partial))
-   {
-      // the input so far could not possibly be valid so reject it:
-      throw std::runtime_error("Invalid data entered - this could not possibly be a valid card number");
-   }
-   // OK so far so good, but have we finished?
-   if(what[0].matched)
-   {
-      // excellent, we have a result:
-      return true;
-   }
-   // what we have so far is only a partial match...
-   return false;
-}
- -

In the following example, text -input is taken from a stream containing an unknown amount of -text; this example simply counts the number of html tags -encountered in the stream. The text is loaded into a buffer and -searched a part at a time, if a partial match was encountered, -then the partial match gets searched a second time as the start -of the next batch of text:

- -
#include <iostream>
-#include <fstream>
-#include <sstream>
-#include <string>
-#include <boost/regex.hpp>
-
-// match some kind of html tag:
-boost::regex e("<[^>]*>");
-// count how many:
-unsigned int tags = 0;
-// saved position of partial match:
-char* next_pos = 0;
-
-bool grep_callback(const boost::match_results<char*>& m)
-{
-   if(m[0].matched == false)
-   {
-      // save position and return:
-      next_pos = m[0].first;
-   }
-   else
-      ++tags;
-   return true;
-}
-
-void search(std::istream& is)
-{
-   char buf[4096];
-   next_pos = buf + sizeof(buf);
-   bool have_more = true;
-   while(have_more)
-   {
-      // how much do we copy forward from last try:
-      unsigned leftover = (buf + sizeof(buf)) - next_pos;
-      // and how much is left to fill:
-      unsigned size = next_pos - buf;
-      // copy forward whatever we have left:
-      memcpy(buf, next_pos, leftover);
-      // fill the rest from the stream:
-      unsigned read = is.readsome(buf + leftover, size);
-      // check to see if we've run out of text:
-      have_more = read == size;
-      // reset next_pos:
-      next_pos = buf + sizeof(buf);
-      // and then grep:
-      boost::regex_grep(grep_callback,
-                        buf,
-                        buf + read + leftover,
-                        e,
-                        boost::match_default | boost::match_partial);
-   }
-}
- -
- -

Copyright Dr -John Maddock 1998-2001 all rights reserved.

- - diff --git a/test/pathology/bad_expression_test.cpp b/test/pathology/bad_expression_test.cpp new file mode 100644 index 00000000..8a929941 --- /dev/null +++ b/test/pathology/bad_expression_test.cpp @@ -0,0 +1,52 @@ +/* + * + * Copyright (c) 1998-2002 + * Dr John Maddock + * + * Permission to use, copy, modify, distribute and sell this software + * and its documentation for any purpose is hereby granted without fee, + * provided that the above copyright notice appear in all copies and + * that both that copyright notice and this permission notice appear + * in supporting documentation. Dr John Maddock makes no representations + * about the suitability of this software for any purpose. + * It is provided "as is" without express or implied warranty. + * + */ + + /* + * LOCATION: see http://www.boost.org for most recent version. + * FILE: recursion_test.cpp + * VERSION: see + * DESCRIPTION: Test for indefinite recursion and/or stack overrun. + */ + +#include +#include +#include + +int test_main( int argc, char* argv[] ) +{ + std::string bad_text(1024, ' '); + std::string good_text(200, ' '); + good_text.append("xyz"); + + boost::smatch what; + + boost::regex e1("(.+)+xyz"); + + BOOST_CHECK(boost::regex_search(good_text, what, e1)); + BOOST_CHECK_THROW(boost::regex_search(bad_text, what, e1), std::runtime_error); + BOOST_CHECK(boost::regex_search(good_text, what, e1)); + + BOOST_CHECK(boost::regex_match(good_text, what, e1)); + BOOST_CHECK_THROW(boost::regex_match(bad_text, what, e1), std::runtime_error); + BOOST_CHECK(boost::regex_match(good_text, what, e1)); + + boost::regex e2("abc|[[:space:]]+(xyz)?[[:space:]]+xyz"); + + BOOST_CHECK(boost::regex_search(good_text, what, e2)); + BOOST_CHECK_THROW(boost::regex_search(bad_text, what, e2), std::runtime_error); + BOOST_CHECK(boost::regex_search(good_text, what, e2)); + + return 0; +} diff --git a/test/pathology/recursion_test.cpp b/test/pathology/recursion_test.cpp new file mode 100644 index 00000000..1a67eee1 --- /dev/null +++ b/test/pathology/recursion_test.cpp @@ -0,0 +1,63 @@ +/* + * + * Copyright (c) 1998-2002 + * Dr John Maddock + * + * Permission to use, copy, modify, distribute and sell this software + * and its documentation for any purpose is hereby granted without fee, + * provided that the above copyright notice appear in all copies and + * that both that copyright notice and this permission notice appear + * in supporting documentation. Dr John Maddock makes no representations + * about the suitability of this software for any purpose. + * It is provided "as is" without express or implied warranty. + * + */ + + /* + * LOCATION: see http://www.boost.org for most recent version. + * FILE: recursion_test.cpp + * VERSION: see + * DESCRIPTION: Test for indefinite recursion and/or stack overrun. + */ + +#include +#include +#include + +int test_main( int argc, char* argv[] ) +{ + // this regex will recurse twice for each whitespace character matched: + boost::regex e("([[:space:]]|.)+"); + + std::string bad_text(1024*1024*4, ' '); + std::string good_text(200, ' '); + + boost::smatch what; + + // + // Over and over: We want to make sure that after a stack error has + // been triggered, that we can still conduct a good search and that + // subsequent stack failures still do the right thing: + // + BOOST_CHECK(boost::regex_search(good_text, what, e)); + BOOST_CHECK_THROW(boost::regex_search(bad_text, what, e), std::runtime_error); + BOOST_CHECK(boost::regex_search(good_text, what, e)); + BOOST_CHECK_THROW(boost::regex_search(bad_text, what, e), std::runtime_error); + BOOST_CHECK(boost::regex_search(good_text, what, e)); + BOOST_CHECK_THROW(boost::regex_search(bad_text, what, e), std::runtime_error); + BOOST_CHECK(boost::regex_search(good_text, what, e)); + BOOST_CHECK_THROW(boost::regex_search(bad_text, what, e), std::runtime_error); + BOOST_CHECK(boost::regex_search(good_text, what, e)); + + BOOST_CHECK(boost::regex_match(good_text, what, e)); + BOOST_CHECK_THROW(boost::regex_match(bad_text, what, e), std::runtime_error); + BOOST_CHECK(boost::regex_match(good_text, what, e)); + BOOST_CHECK_THROW(boost::regex_match(bad_text, what, e), std::runtime_error); + BOOST_CHECK(boost::regex_match(good_text, what, e)); + BOOST_CHECK_THROW(boost::regex_match(bad_text, what, e), std::runtime_error); + BOOST_CHECK(boost::regex_match(good_text, what, e)); + BOOST_CHECK_THROW(boost::regex_match(bad_text, what, e), std::runtime_error); + BOOST_CHECK(boost::regex_match(good_text, what, e)); + + return 0; +} \ No newline at end of file diff --git a/test/regress/v3_tests.txt b/test/regress/v3_tests.txt new file mode 100644 index 00000000..5ad00e7f --- /dev/null +++ b/test/regress/v3_tests.txt @@ -0,0 +1,908 @@ +; +; +; this file contains a script of tests to run through regress.exe +; +; comments start with a semicolon and proceed to the end of the line +; +; changes to regular expression compile flags start with a "-" as the first +; non-whitespace character and consist of a list of the printable names +; of the flags, for example "match_default" +; +; Other lines contain a test to perform using the current flag status +; the first token contains the expression to compile, the second the string +; to match it against. If the second string is "!" then the expression should +; not compile, that is the first string is an invalid regular expression. +; This is then followed by a list of integers that specify what should match, +; each pair represents the starting and ending positions of a subexpression +; starting with the zeroth subexpression (the whole match). +; A value of -1 indicates that the subexpression should not take part in the +; match at all, if the first value is -1 then no part of the expression should +; match the string. +; + +- match_default normal REG_EXTENDED + +; +; try some really simple literals: +a a 0 1 +Z Z 0 1 +Z aaa -1 -1 +Z xxxxZZxxx 4 5 + +; and some simple brackets: +(a) zzzaazz 3 4 3 4 +() zzz 0 0 0 0 +() "" 0 0 0 0 +( ! +) ! +(aa ! +aa) ! +a b -1 -1 +\(\) () 0 2 +\(a\) (a) 0 3 +\() ! +(\) ! +p(a)rameter ABCparameterXYZ 3 12 4 5 +[pq](a)rameter ABCparameterXYZ 3 12 4 5 + +; now try escaped brackets: +- match_default bk_parens REG_BASIC +\(a\) zzzaazz 3 4 3 4 +\(\) zzz 0 0 0 0 +\(\) "" 0 0 0 0 +\( ! +\) ! +\(aa ! +aa\) ! +() () 0 2 +(a) (a) 0 3 +(\) ! +\() ! + +; now move on to "." wildcards +- match_default normal REG_EXTENDED REG_STARTEND +. a 0 1 +. \n 0 1 +. \r 0 1 +. \0 0 1 +- match_default normal match_not_dot_newline REG_EXTENDED REG_STARTEND REG_NEWLINE +. a 0 1 +. \n -1 -1 +. \r -1 -1 +. \0 0 1 +- match_default normal match_not_dot_null match_not_dot_newline REG_EXTENDED REG_STARTEND REG_NEWLINE +. \n -1 -1 +. \r -1 -1 +; this *WILL* produce an error from the POSIX API functions: +- match_default normal match_not_dot_null match_not_dot_newline REG_EXTENDED REG_STARTEND REG_NEWLINE REG_NO_POSIX_TEST +. \0 -1 -1 + + +; +; now move on to the repetion ops, +; starting with operator * +- match_default normal REG_EXTENDED +a* b 0 0 +ab* a 0 1 +ab* ab 0 2 +ab* sssabbbbbbsss 3 10 +ab*c* a 0 1 +ab*c* abbb 0 4 +ab*c* accc 0 4 +ab*c* abbcc 0 5 +*a ! +\<* ! +\>* ! +\n* \n\n 0 2 +\** ** 0 2 +\* * 0 1 + +; now try operator + +ab+ a -1 -1 +ab+ ab 0 2 +ab+ sssabbbbbbsss 3 10 +ab+c+ a -1 -1 +ab+c+ abbb -1 -1 +ab+c+ accc -1 -1 +ab+c+ abbcc 0 5 ++a ! +\<+ ! +\>+ ! +\n+ \n\n 0 2 +\+ + 0 1 +\+ ++ 0 1 +\++ ++ 0 2 +- match_default normal bk_plus_qm REG_EXTENDED REG_NO_POSIX_TEST ++ + 0 1 +\+ ! +a\+ aa 0 2 + +; now try operator ? +- match_default normal REG_EXTENDED +a? b 0 0 +ab? a 0 1 +ab? ab 0 2 +ab? sssabbbbbbsss 3 5 +ab?c? a 0 1 +ab?c? abbb 0 2 +ab?c? accc 0 2 +ab?c? abcc 0 3 +?a ! +\? ! +\n? \n\n 0 1 +\? ? 0 1 +\? ?? 0 1 +\?? ?? 0 1 +- match_default normal bk_plus_qm REG_EXTENDED REG_NO_POSIX_TEST +? ? 0 1 +\? ! +a\? aa 0 1 +a\? b 0 0 + +- match_default normal limited_ops +a? a? 0 2 +a+ a+ 0 2 +a\? a? 0 2 +a\+ a+ 0 2 + +; now try operator {} +- match_default normal REG_EXTENDED +a{2} a -1 -1 +a{2} aa 0 2 +a{2} aaa 0 2 +a{2,} a -1 -1 +a{2,} aa 0 2 +a{2,} aaaaa 0 5 +a{2,4} a -1 -1 +a{2,4} aa 0 2 +a{2,4} aaa 0 3 +a{2,4} aaaa 0 4 +a{2,4} aaaaa 0 4 +; spaces are now allowed inside {} +"a{ 2 , 4 }" aaaaa 0 4 +a{} ! +"a{ }" ! +a{2 ! +a} ! +\{\} {} 0 2 + +- match_default normal bk_braces +a\{2\} a -1 -1 +a\{2\} aa 0 2 +a\{2\} aaa 0 2 +a\{2,\} a -1 -1 +a\{2,\} aa 0 2 +a\{2,\} aaaaa 0 5 +a\{2,4\} a -1 -1 +a\{2,4\} aa 0 2 +a\{2,4\} aaa 0 3 +a\{2,4\} aaaa 0 4 +a\{2,4\} aaaaa 0 4 +"a\{ 2 , 4 \}" aaaaa 0 4 +{} {} 0 2 + +; now test the alternation operator | +- match_default normal REG_EXTENDED +a|b a 0 1 +a|b b 0 1 +a(b|c) ab 0 2 1 2 +a(b|c) ac 0 2 1 2 +a(b|c) ad -1 -1 -1 -1 +|c ! +c| ! +(|) ! +(a|) ! +(|a) ! +a\| a| 0 2 +- match_default normal limited_ops +a| a| 0 2 +a\| a| 0 2 +| | 0 1 +- match_default normal bk_vbar REG_NO_POSIX_TEST +a| a| 0 2 +a\|b a 0 1 +a\|b b 0 1 + +; now test the set operator [] +- match_default normal REG_EXTENDED +; try some literals first +[abc] a 0 1 +[abc] b 0 1 +[abc] c 0 1 +[abc] d -1 -1 +[^bcd] a 0 1 +[^bcd] b -1 -1 +[^bcd] d -1 -1 +[^bcd] e 0 1 +a[b]c abc 0 3 +a[ab]c abc 0 3 +a[^ab]c adc 0 3 +a[]b]c a]c 0 3 +a[[b]c a[c 0 3 +a[-b]c a-c 0 3 +a[^]b]c adc 0 3 +a[^-b]c adc 0 3 +a[b-]c a-c 0 3 +a[b ! +a[] ! + +; then some ranges +[b-e] a -1 -1 +[b-e] b 0 1 +[b-e] e 0 1 +[b-e] f -1 -1 +[^b-e] a 0 1 +[^b-e] b -1 -1 +[^b-e] e -1 -1 +[^b-e] f 0 1 +a[1-3]c a2c 0 3 +a[3-1]c ! +a[1-3-5]c ! +a[1- ! + +; and some classes +a[[:alpha:]]c abc 0 3 +a[[:unknown:]]c ! +a[[: ! +a[[:alpha ! +a[[:alpha:] ! +a[[:alpha,:] ! +a[[:]:]]b ! +a[[:-:]]b ! +a[[:alph:]] ! +a[[:alphabet:]] ! +[[:alnum:]]+ -%@a0X_- 3 6 +[[:alpha:]]+ -%@aX_0- 3 5 +[[:blank:]]+ "a \tb" 1 4 +[[:cntrl:]]+ a\n\tb 1 3 +[[:digit:]]+ a019b 1 4 +[[:graph:]]+ " a%b " 1 4 +[[:lower:]]+ AabC 1 3 +; This test fails with STLPort, disable for now as this is a corner case anyway... +;[[:print:]]+ "\na b\n" 1 4 +[[:punct:]]+ " %-&\t" 1 4 +[[:space:]]+ "a \n\t\rb" 1 5 +[[:upper:]]+ aBCd 1 3 +[[:xdigit:]]+ p0f3Cx 1 5 + +; now test flag settings: +- escape_in_lists REG_NO_POSIX_TEST +[\n] \n 0 1 +- REG_NO_POSIX_TEST +[\n] \n -1 -1 +[\n] \\ 0 1 +[[:class:] : 0 1 +[[:class:] [ 0 1 +[[:class:] c 0 1 + +; line anchors +- match_default normal REG_EXTENDED +^ab ab 0 2 +^ab xxabxx -1 -1 +^ab xx\nabzz 3 5 +ab$ ab 0 2 +ab$ abxx -1 -1 +ab$ ab\nzz 0 2 +- match_default match_not_bol match_not_eol normal REG_EXTENDED REG_NOTBOL REG_NOTEOL +^ab ab -1 -1 +^ab xxabxx -1 -1 +^ab xx\nabzz 3 5 +ab$ ab -1 -1 +ab$ abxx -1 -1 +ab$ ab\nzz 0 2 + +; back references +- match_default normal REG_EXTENDED +a(b)\2c ! +a(b\1)c ! +a(b*)c\1d abbcbbd 0 7 1 3 +a(b*)c\1d abbcbd -1 -1 +a(b*)c\1d abbcbbbd -1 -1 +^(.)\1 abc -1 -1 +a([bc])\1d abcdabbd 4 8 5 6 +; strictly speaking this is at best ambiguous, at worst wrong, this is what most +; re implimentations will match though. +a(([bc])\2)*d abbccd 0 6 3 5 3 4 + +a(([bc])\2)*d abbcbd -1 -1 +a((b)*\2)*d abbbd 0 5 1 4 2 3 +(ab*)[ab]*\1 ababaaa 0 7 0 1 +(a)\1bcd aabcd 0 5 0 1 +(a)\1bc*d aabcd 0 5 0 1 +(a)\1bc*d aabd 0 4 0 1 +(a)\1bc*d aabcccd 0 7 0 1 +(a)\1bc*[ce]d aabcccd 0 7 0 1 +^(a)\1b(c)*cd$ aabcccd 0 7 0 1 4 5 + +; +; characters by code: +- match_default normal REG_EXTENDED REG_STARTEND +\0101 A 0 1 +\00 \0 0 1 +\0 \0 0 1 +\0172 z 0 1 + +; +; word operators: +\w a 0 1 +\w z 0 1 +\w A 0 1 +\w Z 0 1 +\w _ 0 1 +\w } -1 -1 +\w ` -1 -1 +\w [ -1 -1 +\w @ -1 -1 +; non-word: +\W a -1 -1 +\W z -1 -1 +\W A -1 -1 +\W Z -1 -1 +\W _ -1 -1 +\W } 0 1 +\W ` 0 1 +\W [ 0 1 +\W @ 0 1 +; word start: +\ abc 0 3 +abc\> abcd -1 -1 +abc\> abc\n 0 3 +abc\> abc:: 0 3 +; word boundary: +\babcd " abcd" 2 6 +\bab cab -1 -1 +\bab "\nab" 1 3 +\btag ::tag 2 5 +abc\b abc 0 3 +abc\b abcd -1 -1 +abc\b abc\n 0 3 +abc\b abc:: 0 3 +; within word: +\B ab 1 1 +a\Bb ab 0 2 +a\B ab 0 1 +a\B a -1 -1 +a\B "a " -1 -1 + +; +; buffer operators: +\`abc abc 0 3 +\`abc \nabc -1 -1 +\`abc " abc" -1 -1 +abc\' abc 0 3 +abc\' abc\n -1 -1 +abc\' "abc " -1 -1 + +; +; extra escape sequences: +\a \a 0 1 +\f \f 0 1 +\n \n 0 1 +\r \r 0 1 +\t \t 0 1 +\v \v 0 1 + + +; +; now follows various complex expressions designed to try and bust the matcher: +a(((b)))c abc 0 3 1 2 1 2 1 2 +a(b|(c))d abd 0 3 1 2 -1 -1 +a(b|(c))d acd 0 3 1 2 1 2 +a(b*|c)d abbd 0 4 1 3 +; just gotta have one DFA-buster, of course +a[ab]{20} aaaaabaaaabaaaabaaaab 0 21 +; and an inline expansion in case somebody gets tricky +a[ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab] aaaaabaaaabaaaabaaaab 0 21 +; and in case somebody just slips in an NFA... +a[ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab](wee|week)(knights|night) aaaaabaaaabaaaabaaaabweeknights 0 31 21 24 24 31 +; one really big one +1234567890123456789012345678901234567890123456789012345678901234567890 a1234567890123456789012345678901234567890123456789012345678901234567890b 1 71 +; fish for problems as brackets go past 8 +[ab][cd][ef][gh][ij][kl][mn] xacegikmoq 1 8 +[ab][cd][ef][gh][ij][kl][mn][op] xacegikmoq 1 9 +[ab][cd][ef][gh][ij][kl][mn][op][qr] xacegikmoqy 1 10 +[ab][cd][ef][gh][ij][kl][mn][op][q] xacegikmoqy 1 10 +; and as parenthesis go past 9: +(a)(b)(c)(d)(e)(f)(g)(h) zabcdefghi 1 9 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 +(a)(b)(c)(d)(e)(f)(g)(h)(i) zabcdefghij 1 10 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 +(a)(b)(c)(d)(e)(f)(g)(h)(i)(j) zabcdefghijk 1 11 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 +(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k) zabcdefghijkl 1 12 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 +(a)d|(b)c abc 1 3 -1 -1 1 2 +"_+((www)|(ftp)|(mailto)):_*" "_wwwnocolon _mailto:" 12 20 13 19 -1 -1 -1 -1 13 19 + +; subtleties of matching +a(b)?c\1d acd 0 3 -1 -1 +a(b?c)+d accd 0 4 2 3 +(wee|week)(knights|night) weeknights 0 10 0 3 3 10 +.* abc 0 3 +a(b|(c))d abd 0 3 1 2 -1 -1 +a(b|(c))d acd 0 3 1 2 1 2 +a(b*|c|e)d abbd 0 4 1 3 +a(b*|c|e)d acd 0 3 1 2 +a(b*|c|e)d ad 0 2 1 1 +a(b?)c abc 0 3 1 2 +a(b?)c ac 0 2 1 1 +a(b+)c abc 0 3 1 2 +a(b+)c abbbc 0 5 1 4 +a(b*)c ac 0 2 1 1 +(a|ab)(bc([de]+)f|cde) abcdef 0 6 0 1 1 6 3 5 +a([bc]?)c abc 0 3 1 2 +a([bc]?)c ac 0 2 1 1 +a([bc]+)c abc 0 3 1 2 +a([bc]+)c abcc 0 4 1 3 +a([bc]+)bc abcbc 0 5 1 3 +a(bb+|b)b abb 0 3 1 2 +a(bbb+|bb+|b)b abb 0 3 1 2 +a(bbb+|bb+|b)b abbb 0 4 1 3 +a(bbb+|bb+|b)bb abbb 0 4 1 2 +(.*).* abcdef 0 6 0 6 +(a*)* bc 0 0 0 0 + +; do we get the right subexpression when it is used more than once? +a(b|c)*d ad 0 2 -1 -1 +a(b|c)*d abcd 0 4 2 3 +a(b|c)+d abd 0 3 1 2 +a(b|c)+d abcd 0 4 2 3 +a(b|c?)+d ad 0 2 1 1 +a(b|c?)+d abcd 0 4 2 3 +a(b|c){0,0}d ad 0 2 -1 -1 +a(b|c){0,1}d ad 0 2 -1 -1 +a(b|c){0,1}d abd 0 3 1 2 +a(b|c){0,2}d ad 0 2 -1 -1 +a(b|c){0,2}d abcd 0 4 2 3 +a(b|c){0,}d ad 0 2 -1 -1 +a(b|c){0,}d abcd 0 4 2 3 +a(b|c){1,1}d abd 0 3 1 2 +a(b|c){1,2}d abd 0 3 1 2 +a(b|c){1,2}d abcd 0 4 2 3 +a(b|c){1,}d abd 0 3 1 2 +a(b|c){1,}d abcd 0 4 2 3 +a(b|c){2,2}d acbd 0 4 2 3 +a(b|c){2,2}d abcd 0 4 2 3 +a(b|c){2,4}d abcd 0 4 2 3 +a(b|c){2,4}d abcbd 0 5 3 4 +a(b|c){2,4}d abcbcd 0 6 4 5 +a(b|c){2,}d abcd 0 4 2 3 +a(b|c){2,}d abcbd 0 5 3 4 +a(b+|((c)*))+d abd 0 3 1 2 -1 -1 -1 -1 +a(b+|((c)*))+d abcd 0 4 2 3 2 3 2 3 + +- match_default normal REG_EXTENDED REG_STARTEND REG_NOSPEC literal +\**?/{} \\**?/{} 0 7 + +- match_default normal REG_EXTENDED REG_NO_POSIX_TEST ; we disable POSIX testing because it can't handle escapes in sets +; try to match C++ syntax elements: +; line comment: +//[^\n]* "++i //here is a line comment\n" 4 28 +; block comment: +/\*([^*]|\*+[^*/])*\*+/ "/* here is a block comment */" 0 29 26 27 +/\*([^*]|\*+[^*/])*\*+/ "/**/" 0 4 -1 -1 +/\*([^*]|\*+[^*/])*\*+/ "/***/" 0 5 -1 -1 +/\*([^*]|\*+[^*/])*\*+/ "/****/" 0 6 -1 -1 +/\*([^*]|\*+[^*/])*\*+/ "/*****/" 0 7 -1 -1 +/\*([^*]|\*+[^*/])*\*+/ "/*****/*/" 0 7 -1 -1 +; preprossor directives: +^[[:blank:]]*#([^\n]*\\[[:space:]]+)*[^\n]* "#define some_symbol" 0 19 -1 -1 +^[[:blank:]]*#([^\n]*\\[[:space:]]+)*[^\n]* "#define some_symbol(x) #x" 0 25 -1 -1 +^[[:blank:]]*#([^\n]*\\[[:space:]]+)*[^\n]* "#define some_symbol(x) \\ \r\n foo();\\\r\n printf(#x);" 0 53 28 42 +; literals: +((0x[[:xdigit:]]+)|([[:digit:]]+))u?((int(8|16|32|64))|L)? 0xFF 0 4 0 4 0 4 -1 -1 -1 -1 -1 -1 -1 -1 +((0x[[:xdigit:]]+)|([[:digit:]]+))u?((int(8|16|32|64))|L)? 35 0 2 0 2 -1 -1 0 2 -1 -1 -1 -1 -1 -1 +((0x[[:xdigit:]]+)|([[:digit:]]+))u?((int(8|16|32|64))|L)? 0xFFu 0 5 0 4 0 4 -1 -1 -1 -1 -1 -1 -1 -1 +((0x[[:xdigit:]]+)|([[:digit:]]+))u?((int(8|16|32|64))|L)? 0xFFL 0 5 0 4 0 4 -1 -1 4 5 -1 -1 -1 -1 +((0x[[:xdigit:]]+)|([[:digit:]]+))u?((int(8|16|32|64))|L)? 0xFFFFFFFFFFFFFFFFuint64 0 24 0 18 0 18 -1 -1 19 24 19 24 22 24 +; strings: +'([^\\']|\\.)*' '\\x3A' 0 6 4 5 +'([^\\']|\\.)*' '\\'' 0 4 1 3 +'([^\\']|\\.)*' '\\n' 0 4 1 3 + +; now try and test some unicode specific characters: +- match_default normal REG_PERL REG_UNICODE_ONLY +[[:unicode:]]+ a\0300\0400z 1 3 +[\x10-\xff] \39135\12409 -1 -1 +[\01-\05]{5} \36865\36865\36865\36865\36865 -1 -1 + +; finally try some case insensitive matches: +- match_default normal REG_EXTENDED REG_ICASE +; upper and lower have no meaning here so they fail, however these +; may compile with other libraries... +;[[:lower:]] ! +;[[:upper:]] ! +0123456789@abcdefghijklmnopqrstuvwxyz\[\\\]\^_`ABCDEFGHIJKLMNOPQRSTUVWXYZ\{\|\} 0123456789@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]\^_`abcdefghijklmnopqrstuvwxyz\{\|\} 0 72 + +; known and suspected bugs: +- match_default normal REG_EXTENDED +\( ( 0 1 +\) ) 0 1 +\$ $ 0 1 +\^ ^ 0 1 +\. . 0 1 +\* * 0 1 +\+ + 0 1 +\? ? 0 1 +\[ [ 0 1 +\] ] 0 1 +\| | 0 1 +\\ \\ 0 1 +# # 0 1 +\# # 0 1 +a- a- 0 2 +\- - 0 1 +\{ { 0 1 +\} } 0 1 +0 0 0 1 +1 1 0 1 +9 9 0 1 +b b 0 1 +B B 0 1 +< < 0 1 +> > 0 1 +w w 0 1 +W W 0 1 +` ` 0 1 +' ' 0 1 +\n \n 0 1 +, , 0 1 +a a 0 1 +f f 0 1 +n n 0 1 +r r 0 1 +t t 0 1 +v v 0 1 +c c 0 1 +x x 0 1 +: : 0 1 +(\.[[:alnum:]]+){2} "w.a.b " 1 5 3 5 + +- match_default normal REG_EXTENDED REG_ICASE +a A 0 1 +A a 0 1 +[abc]+ abcABC 0 6 +[ABC]+ abcABC 0 6 +[a-z]+ abcABC 0 6 +[A-Z]+ abzANZ 0 6 +[a-Z]+ abzABZ 0 6 +[A-z]+ abzABZ 0 6 +[[:lower:]]+ abyzABYZ 0 8 +[[:upper:]]+ abzABZ 0 6 +[[:word:]]+ abcZZZ 0 6 +[[:alpha:]]+ abyzABYZ 0 8 +[[:alnum:]]+ 09abyzABYZ 0 10 + +; updated tests for version 2: +- match_default normal REG_EXTENDED +\x41 A 0 1 +\xff \255 0 1 +\xFF \255 0 1 +- match_default normal REG_EXTENDED REG_NO_POSIX_TEST +\c@ \0 0 1 +- match_default normal REG_EXTENDED +\cA \1 0 1 +\cz \58 0 1 +\c= ! +\c? ! +=: =: 0 2 + +; word start: +[[:<:]]abcd " abcd" 2 6 +[[:<:]]ab cab -1 -1 +[[:<:]]ab "\nab" 1 3 +[[:<:]]tag ::tag 2 5 +;word end: +abc[[:>:]] abc 0 3 +abc[[:>:]] abcd -1 -1 +abc[[:>:]] abc\n 0 3 +abc[[:>:]] abc:: 0 3 + +; collating elements and rewritten set code: +- match_default normal REG_EXTENDED REG_STARTEND +[[.zero.]] 0 0 1 +[[.one.]] 1 0 1 +[[.two.]] 2 0 1 +[[.three.]] 3 0 1 +[[.a.]] baa 1 2 +[[.right-curly-bracket.]] } 0 1 +[[.NUL.]] \0 0 1 +[[:<:]z] ! +[a[:>:]] ! +[[=a=]] a 0 1 +[[=right-curly-bracket=]] } 0 1 +- match_default normal REG_EXTENDED REG_STARTEND REG_ICASE +[[.A.]] A 0 1 +[[.A.]] a 0 1 +[[.A.]-b]+ AaBb 0 4 +[A-[.b.]]+ AaBb 0 4 +[[.a.]-B]+ AaBb 0 4 +[a-[.B.]]+ AaBb 0 4 +- match_default normal REG_EXTENDED REG_NO_POSIX_TEST +[\x61] a 0 1 +[\x61-c]+ abcd 0 3 +[a-\x63]+ abcd 0 3 +- match_default normal REG_EXTENDED REG_STARTEND +[[.a.]-c]+ abcd 0 3 +[a-[.c.]]+ abcd 0 3 +[[:alpha:]-a] ! +[a-[:alpha:]] ! + +; try mutli-character ligatures: +[[.ae.]] ae 0 2 +[[.ae.]] aE -1 -1 +[[.AE.]] AE 0 2 +[[.Ae.]] Ae 0 2 +[[.ae.]-b] a -1 -1 +[[.ae.]-b] b 0 1 +[[.ae.]-b] ae 0 2 +[a-[.ae.]] a 0 1 +[a-[.ae.]] b -1 -1 +[a-[.ae.]] ae 0 2 +- match_default normal REG_EXTENDED REG_STARTEND REG_ICASE +[[.ae.]] AE 0 2 +[[.ae.]] Ae 0 2 +[[.AE.]] Ae 0 2 +[[.Ae.]] aE 0 2 +[[.AE.]-B] a -1 -1 +[[.Ae.]-b] b 0 1 +[[.Ae.]-b] B 0 1 +[[.ae.]-b] AE 0 2 + +- match_default normal REG_EXTENDED REG_STARTEND +;extended perl style escape sequences: +\e \27 0 1 +\x1b \27 0 1 +\x{1b} \27 0 1 +\x{} ! +\x{ ! +\x} ! +\x ! +\x{yy ! +\x{1b ! + +- match_default normal REG_EXTENDED REG_STARTEND REG_NO_POSIX_TEST +\l+ ABabcAB 2 5 +[\l]+ ABabcAB 2 5 +[a-\l] ! +[\l-a] ! +[\L] ! +\L+ abABCab 2 5 +\u+ abABCab 2 5 +[\u]+ abABCab 2 5 +[\U] ! +\U+ ABabcAB 2 5 +\d+ ab012ab 2 5 +[\d]+ ab012ab 2 5 +[\D] ! +\D+ 01abc01 2 5 +\s+ "ab ab" 2 5 +[\s]+ "ab ab" 2 5 +[\S] ! +\S+ " abc " 2 5 +- match_default normal REG_EXTENDED REG_STARTEND +\Qabc ! +\Qabc\E abcd 0 3 +\Qabc\Ed abcde 0 4 +\Q+*?\\E +*?\\ 0 4 + +\C+ abcde 0 5 +\X+ abcde 0 5 + +- match_default normal REG_EXTENDED REG_STARTEND REG_UNICODE_ONLY +\X+ a\768\769 0 3 +\X+ \2309\2307 0 2 ;DEVANAGARI script +\X+ \2489\2494 0 2 ;BENGALI script + +- match_default normal REG_EXTENDED REG_STARTEND +\Aabc abc 0 3 +\Aabc aabc -1 -1 +abc\z abc 0 3 +abc\z abcd -1 -1 +abc\Z abc\n\n 0 3 +abc\Z abc 0 3 + + +\Gabc abc 0 3 +\Gabc dabcd -1 -1 +a\Gbc abc -1 -1 +a\Aab abc -1 -1 + +; +; now test grep, +; basically check all our restart types - line, word, etc +; checking each one for null and non-null matches. +; +- match_default normal REG_EXTENDED REG_STARTEND REG_GREP +a " a a a aa" 1 2 3 4 5 6 7 8 8 9 +a+b+ "aabaabbb ab" 0 3 3 8 9 11 +a(b*|c|e)d adabbdacd 0 2 2 6 6 9 +a "\na\na\na\naa" 1 2 3 4 5 6 7 8 8 9 + +^ " \n\n \n\n\n" 0 0 4 4 5 5 8 8 9 9 10 10 +^ab "ab \nab ab\n" 0 2 5 7 +^[^\n]*\n " \n \n\n \n" 0 4 4 7 7 8 8 11 +\ <123><><><> +[[:digit:]]* 123ab1 <$0> <123><><><1> + +; and now escapes: +a+ "...aaa,,," $x "$x" +a+ "...aaa,,," \a "\a" +a+ "...aaa,,," \f "\f" +a+ "...aaa,,," \n "\n" +a+ "...aaa,,," \r "\r" +a+ "...aaa,,," \t "\t" +a+ "...aaa,,," \v "\v" + +a+ "...aaa,,," \x21 "!" +a+ "...aaa,,," \x{21} "!" +a+ "...aaa,,," \c@ \0 +a+ "...aaa,,," \e \27 +a+ "...aaa,,," \0101 A +a+ "...aaa,,," (\0101) A + +- match_default normal REG_EXTENDED REG_STARTEND REG_MERGE format_sed format_no_copy +(a+)(b+) ...aabb,, \0 aabb +(a+)(b+) ...aabb,, \1 aa +(a+)(b+) ...aabb,, \2 bb +(a+)(b+) ...aabb,, & aabb +(a+)(b+) ...aabb,, $ $ +(a+)(b+) ...aabb,, $1 $1 +(a+)(b+) ...aabb,, ()?: ()?: +(a+)(b+) ...aabb,, \\ \\ +(a+)(b+) ...aabb,, \& & + + +- match_default normal REG_EXTENDED REG_STARTEND REG_MERGE format_perl format_no_copy +(a+)(b+) ...aabb,, $0 aabb +(a+)(b+) ...aabb,, $1 aa +(a+)(b+) ...aabb,, $2 bb +(a+)(b+) ...aabb,, $& aabb +(a+)(b+) ...aabb,, & & +(a+)(b+) ...aabb,, \0 \0 +(a+)(b+) ...aabb,, ()?: ()?: + +- match_default normal REG_EXTENDED REG_STARTEND REG_MERGE +; move to copying unmatched data: +a+ "...aaa,,," bbb "...bbb,,," +a+(b+) "...aaabb,,," $1 "...bb,,," +a+(b+) "...aaabb,,,ab*abbb?" $1 "...bb,,,b*bbb?" + +(a+)|(b+) "...aaabb,,,ab*abbb?" (?1A)(?2B) "...AB,,,AB*AB?" +(a+)|(b+) "...aaabb,,,ab*abbb?" ?1A:B "...AB,,,AB*AB?" +(a+)|(b+) "...aaabb,,,ab*abbb?" (?1A:B)C "...ACBC,,,ACBC*ACBC?" +(a+)|(b+) "...aaabb,,,ab*abbb?" ?1:B "...B,,,B*B?" + +- match_default normal REG_EXTENDED REG_STARTEND REG_MERGE format_first_only +; move to copying unmatched data, but replace first occurance only: +a+ "...aaa,,," bbb "...bbb,,," +a+(b+) "...aaabb,,," $1 "...bb,,," +a+(b+) "...aaabb,,,ab*abbb?" $1 "...bb,,,ab*abbb?" +(a+)|(b+) "...aaabb,,,ab*abbb?" (?1A)(?2B) "...Abb,,,ab*abbb?" + +; +; changes to newline handling with 2.11: +; + +- match_default normal REG_EXTENDED REG_STARTEND REG_GREP + +^. " \n \r\n " 0 1 3 4 7 8 +.$ " \n \r\n " 1 2 4 5 8 9 + +- match_default normal REG_EXTENDED REG_STARTEND REG_GREP REG_UNICODE_ONLY +^. " \8232 \8233 " 0 1 3 4 5 6 +.$ " \8232 \8233 " 1 2 3 4 6 7 + +; +; non-greedy repeats added 21/04/00 +- match_default normal REG_EXTENDED +a** ! +a*? aa 0 0 +a?? aa 0 0 +a++ ! +a+? aa 0 1 +a{1,3}{1} ! +a{1,3}? aaa 0 1 +\w+?w ...ccccccwcccccw 3 10 +\W+\w+?w ...ccccccwcccccw 0 10 +abc|\w+? abd 0 1 +abc|\w+? abcd 0 3 +<\s*tag[^>]*>(.*?)<\s*/tag\s*> " here is some text " 1 29 6 23 +<\s*tag[^>]*>(.*?)<\s*/tag\s*> " < tag attr=\"something\">here is some text< /tag > " 1 49 24 41 + +; +; non-marking parenthesis added 25/04/00 +- match_default normal REG_EXTENDED +(?:abc)+ xxabcabcxx 2 8 +(?:a+)(b+) xaaabbbx 1 7 4 7 +(a+)(?:b+) xaaabbba 1 7 1 4 +(?:(a+)b+) xaaabbba 1 7 1 4 +(?:a+(b+)) xaaabbba 1 7 4 7 +a+(?#b+)b+ xaaabbba 1 7 +(a)(?:b|$) ab 0 2 0 1 +(a)(?:b|$) a 0 1 0 1 + + +; +; try some partial matches: +- match_partial match_default normal REG_EXTENDED REG_NO_POSIX_TEST +(xyz)(.*)abc xyzaaab -1 -1 0 3 3 7 +(xyz)(.*)abc xyz -1 -1 0 3 3 3 +(xyz)(.*)abc xy -1 -1 -1 -1 -1 -1 + +; +; forward lookahead asserts added 21/01/02 +- match_default normal REG_EXTENDED REG_NO_POSIX_TEST +((?:(?!a|b)\w)+)(\w+) " xxxabaxxx " 2 11 2 5 5 11 + +/\*(?:(?!\*/).)*\*/ " /**/ " 2 6 +/\*(?:(?!\*/).)*\*/ " /***/ " 2 7 +/\*(?:(?!\*/).)*\*/ " /********/ " 2 12 +/\*(?:(?!\*/).)*\*/ " /* comment */ " 2 15 + +<\s*a[^>]*>((?:(?!<\s*/\s*a\s*>).)*)<\s*/\s*a\s*> " here " 1 24 16 20 +<\s*a[^>]*>((?:(?!<\s*/\s*a\s*>).)*)<\s*/\s*a\s*> " here< / a > " 1 28 16 20 + +<\s*a[^>]*>((?:(?!<\s*/\s*a\s*>).)*)(?=<\s*/\s*a\s*>) " here " 1 20 16 20 +<\s*a[^>]*>((?:(?!<\s*/\s*a\s*>).)*)(?=<\s*/\s*a\s*>) " here< / a > " 1 20 16 20 + +; filename matching: +^(?!^(?:PRN|AUX|CLOCK\$|NUL|CON|COM\d|LPT\d|\..*)(?:\..+)?$)[^\x00-\x1f\\?*:\"|/]+$ command.com 0 11 +^(?!^(?:PRN|AUX|CLOCK\$|NUL|CON|COM\d|LPT\d|\..*)(?:\..+)?$)[^\x00-\x1f\\?*:\"|/]+$ PRN -1 -1 +^(?!^(?:PRN|AUX|CLOCK\$|NUL|CON|COM\d|LPT\d|\..*)(?:\..+)?$)[^\x00-\x1f\\?*:\"|/]+$ COM2 -1 -1 + +; password checking: +^(?=.*\d).{4,8}$ abc3 0 4 +^(?=.*\d).{4,8}$ abc3def4 0 8 +^(?=.*\d).{4,8}$ ab2 -1 -1 +^(?=.*\d).{4,8}$ abcdefg -1 -1 +^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{4,8}$ abc3 -1 -1 +^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{4,8}$ abC3 0 4 +^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{4,8}$ ABCD3 -1 -1 + + + + + diff --git a/traits_class_ref.htm b/traits_class_ref.htm deleted file mode 100644 index 669f5a87..00000000 --- a/traits_class_ref.htm +++ /dev/null @@ -1,1016 +0,0 @@ - - - - - - - - regex++ traits-class reference - - - - - - - - - -

C++ Boost

-

Regex++, Traits Class - Reference.

-

Copyright (c) 1998-2001

-

Dr John Maddock

-

Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty.

-
- -
- -

This section describes the traits class requirements of the -reg_expression template class, these requirements are somewhat -complex (sorry), and subject to change as uses ask for new -features, however I will try to keep them stable for a while, and -ideally the requirements should lessen rather than increase.

- -

The reg_expression traits classes encapsulate both the -properties of a character type, and the properties of the locale -associated with that type. The associated locale may be defined -at run-time (via std::locale), or hard-coded into the traits -class and determined at compile time.

- -

The following example class illustrates the interface required -by a "typical" traits class for use with class -reg_expression:

- -
-class mytraits
-{
-   typedef implementation_defined char_type;
-   typedef implementation_defined uchar_type;
-   typedef implementation_defined size_type;
-   typedef implementation_defined string_type;
-   typedef implementation_defined locale_type;
-   typedef implementation_defined uint32_t;
-   struct sentry
-   {
-      sentry(const mytraits&);
-      operator void*() { return this; }
-   };
-
-   enum char_syntax_type
-   {
-      syntax_char = 0,
-      syntax_open_bracket = 1,                  // (
-      syntax_close_bracket = 2,                 // )
-      syntax_dollar = 3,                        // $
-      syntax_caret = 4,                         // ^
-      syntax_dot = 5,                           // .
-      syntax_star = 6,                          // *
-      syntax_plus = 7,                          // +
-      syntax_question = 8,                      // ?
-      syntax_open_set = 9,                      // [
-      syntax_close_set = 10,                    // ]
-      syntax_or = 11,                           // |
-      syntax_slash = 12,                        //
-      syntax_hash = 13,                         // #
-      syntax_dash = 14,                         // -
-      syntax_open_brace = 15,                   // {
-      syntax_close_brace = 16,                  // }
-      syntax_digit = 17,                        // 0-9
-      syntax_b = 18,                            // for \b
-      syntax_B = 19,                            // for \B
-      syntax_left_word = 20,                    // for \<
-      syntax_right_word = 21,                   // for \
-      syntax_w = 22,                            // for \w
-      syntax_W = 23,                            // for \W
-      syntax_start_buffer = 24,                 // for \`
-      syntax_end_buffer = 25,                   // for \'
-      syntax_newline = 26,                      // for newline alt
-      syntax_comma = 27,                        // for {x,y}
-
-      syntax_a = 28,                            // for \a
-      syntax_f = 29,                            // for \f
-      syntax_n = 30,                            // for \n
-      syntax_r = 31,                            // for \r
-      syntax_t = 32,                            // for \t
-      syntax_v = 33,                            // for \v
-      syntax_x = 34,                            // for \xdd
-      syntax_c = 35,                            // for \cx
-      syntax_colon = 36,                        // for [:...:]
-      syntax_equal = 37,                        // for [=...=]
-   
-      // perl ops:
-      syntax_e = 38,                            // for \e
-      syntax_l = 39,                            // for \l
-      syntax_L = 40,                            // for \L
-      syntax_u = 41,                            // for \u
-      syntax_U = 42,                            // for \U
-      syntax_s = 43,                            // for \s
-      syntax_S = 44,                            // for \S
-      syntax_d = 45,                            // for \d
-      syntax_D = 46,                            // for \D
-      syntax_E = 47,                            // for \Q\E
-      syntax_Q = 48,                            // for \Q\E
-      syntax_X = 49,                            // for \X
-      syntax_C = 50,                            // for \C
-      syntax_Z = 51,                            // for \Z
-      syntax_G = 52,                            // for \G
-      syntax_bang = 53,                         // reserved for future use '!'
-      syntax_and = 54,                          // reserve for future use '&'
-   };
-
-   enum{
-      char_class_none = 0,
-      char_class_alpha,
-      char_class_cntrl,
-      char_class_digit,
-      char_class_lower,
-      char_class_punct,
-      char_class_space,
-      char_class_upper,
-      char_class_xdigit,
-      char_class_blank,
-      char_class_unicode,
-      char_class_alnum,
-      char_class_graph,
-      char_class_print,
-      char_class_word
-   };
-
-   static size_t length(const char_type* p);
-   unsigned int syntax_type(size_type c)const;
-   char_type translate(char_type c, bool icase)const;
-   void transform(string_type& out, const string_type& in)const;
-   void transform_primary(string_type& out, const string_type& in)const;
-   bool is_separator(char_type c)const;
-   bool is_combining(char_type)const;
-   bool is_class(char_type c, uint32_t f)const;
-   int toi(char_type c)const;
-   int toi(const char_type*& first, const char_type* last, int radix)const;
-   uint32_t lookup_classname(const char_type* first, const char_type* last)const;
-   bool lookup_collatename(string_type& buf, const char_type* first, const char_type* last)const;
-   locale_type imbue(locale_type l);
-   locale_type getloc()const;
-   std::string error_string(unsigned id)const;
-
-   mytraits();
-   ~mytraits();
-};
-
- -

The member types required by a traits class are defined as -follows:
-  

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  Member - name Description -  
  char_type The - character type encapsulated by this traits class, must be - a POD type, and be convertible to uchar_type.  
  uchar_type - The - unsigned type corresponding to char_type, must be - convertible to size_type.  
  size_type An - unsigned integral type, with at least as much precision - as uchar_type.  
  string_type - A type - that offers the same facilities as std::basic_string<char_type. - This is used for collating elements, and sort strings, if - char_type has no locale dependent collation (it is not a - "character"), then it could be something - simpler than std::basic_string.  
  locale_type - A type - that encapsulates the locale used by the traits class, - probably std::locale but could be a platform specific - type, or a dummy type if per-instance locales are not - supported by the traits class.  
  uint32_t An - unsigned integral type with at least 32-bits of - precision, used as a bitmask type for character - classification.  
  sentry A class or - struct type which is constructible from an instance of - the traits class, and is convertible to void*. An - instance of type sentry will be constructed before - compiling each regular expression, it provides an - opportunity to carry out prefix/suffix operations on the - traits class. 

For example a traits class that - encapsulates the global locale, can use this as an - opportunity to synchronize with the global locale (by - updating any cached data).

-
 
- -


- The following member constants are used to represent the -locale independent syntax of a regular expression; the member -function syntax_type returns one of these values, and is -used to convert a locale dependent regular expression, into a -locale-independent sequence of tokens.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  Member - constant  English - language representation   
  syntax_char  - All non-special - characters.   
  syntax_open_bracket  -  
  syntax_close_bracket  -  
  syntax_dollar  -  
  syntax_caret  -  
  syntax_dot  -  
  syntax_star  -  
  syntax_plus  -  
  syntax_question  -  
  syntax_open_set  -  
  syntax_close_set  -  
  syntax_or  -  
  syntax_slash  -  
  syntax_hash  -  
  syntax_dash  -  
  syntax_open_brace  -  
  syntax_close_brace  -  
  syntax_digit  - 0123456789  -  
  syntax_b  -  
  syntax_B  -  
  syntax_left_word  - <  -  
  syntax_right_word  -    
  syntax_w  -  
  syntax_W  -  
  syntax_start_buffer  -  
  syntax_end_buffer  -  
  syntax_newline  - \n   
  syntax_comma  -  
  syntax_a  -  
  syntax_f  -  
  syntax_n  -  
  syntax_r  -  
  syntax_t  -  
  syntax_v  -  
  syntax_x  -  
  syntax_c  -  
  syntax_colon  -  
  syntax_equal  -  
  syntax_e  -  
  syntax_l  -  
  syntax_L  -  
  syntax_u  -  
  syntax_U  -  
  syntax_s  -  
  syntax_S  -  
  syntax_d  -  
  syntax_D  -  
  syntax_E  -  
  syntax_Q  -  
  syntax_X  -  
  syntax_C  -  
  syntax_Z  -  
  syntax_G  -  
  syntax_bang  -  
  syntax_and  - &  -  
- -

The following member constants are used to represent -particular character classifications:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  Member - constant  Description -  
  char_class_none  - No - classification, must be zero.  
  char_class_alpha  - All - alphabetic characters.  
  char_class_cntrl  - All - control characters.  
  char_class_digit  - All - decimal digits.  
  char_class_lower  - All lower - case characters.  
  char_class_punct  - All - punctuation characters.  
  char_class_space  - All white-space - characters.  
  char_class_upper  - All upper - case characters.  
  char_class_xdigit  - All - hexadecimal digit characters.  
  char_class_blank  - All blank - characters (space + tab).  
  char_class_unicode  - All - extended unicode characters - those that can not be - represented as a single narrow character.  
  char_class_alnum  - All alpha-numeric - characters.  
  char_class_graph  - All - graphic characters.  
  char_class_print  - All - printable characters.  
  char_class_word  - All word - characters (alphanumeric characters + the underscore).  
- -

The following member functions are required by all regular -expression traits classes, those members that are declared here -as const, could be declared static instead if the -class does not contain instance data:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  Member - function Description -  
  static - size_t length(const char_type* p); Returns - the length of the null-terminated string p.  
  unsigned - int syntax_type(size_type c)const;  Converts - an input character into a locale independent token (one - of the syntax_xxx member constants). Called when parsing - the regular expression into a locale-independent parse - tree. 

Example: in English language regular - expressions we would use "[[:word:]]" to - represent the character class of all word characters, and - "\w" as a shortcut for this. Consequently - syntax_type('w') returns syntax_w. In French language - regular expressions, we would use "[[:mot:]]" - in place of "[[:word:]]" and therefore "\m" - in place of "\w", therefore it is syntax_type('m') - that returns syntax_w.

-
 
  char_type - translate(char_type c, bool icase)const;  Translates - an input character into a unique identifier that - represents the equivalence class that that character - belongs to. If icase is true, then the returned value is - insensitive to case. 

[An equivalence class is - the set of all characters that must be treated as being - equivalent to each other.]

-
 
  void - transform(string_type& out, const string_type& in)const; -  Transforms - the string in, into a locale-dependent sort key, - and stores the result in out.  
  void - transform_primary(string_type& out, const - string_type& in)const;  Transforms - the string in, into a locale-dependent primary - sort key, and stores the result in out.  
  bool - is_separator(char_type c)const;  Returns - true only if c is a line separator.  
  bool - is_combining(char_type c)const;  Returns - true only if c is a unicode combining character.  
  bool - is_class(char_type c, uint32_t f)const;  Returns - true only if c is a member of one of the character - classes represented by the bitmap f.  
  int toi(char_type - c)const;  Converts - the character c to a decimal integer. 

[Precondition: - is_class(c,char_class_digit)==true]

-
 
  int toi(const - char_type*& first, const char_type* last, int radix)const; -  Converts - the string [first-last) into an integral value using base - radix. Stops when it finds the first non-digit - character, and sets first to point to that - character. 

[Precondition: is_class(*first,char_class_digit)==true] -

-
 
  uint32_t - lookup_classname(const char_type* first, const char_type* - last)const;  Returns - the bitmap representing the character class [first-last), - or char_class_none if [first-last) is not recognized as a - character class name.  
  bool - lookup_collatename(string_type& buf, const char_type* - first, const char_type* last)const; If the - sequence [first-last) is the name of a known collating - element, then stores the collating element in buf, and - returns true, otherwise returns false.  
  locale_type - imbue(locale_type l);  Imbues - the class with the locale l.  
  locale_type - getloc()const;  Returns - the traits-class locale.  
  std::string - error_string(unsigned id)const;  Returns - the locale-dependent error-string associated with the - error-number id. The parameter id is one of - the REG_XXX error codes described by the POSIX standard, - and defined in <boost/cregex.hpp.  
  mytraits(); -  Constructor. -  
  ~ mytraits(); -  Destructor. -  
- -

There is also an example of a custom traits class supplied by Christian Engström, -see iso8859_1_regex_traits.cpp -and iso8859_1_regex_traits.hpp. -This example inherits from c_regex_traits and provides it's own -implementations of two locale specific functions. This ensures -that the class gives consistent behaviour (albeit tied to one -locale) on all platforms. A fuller desciption by the author is -available in the readme file.
-

- -
- -

Copyright Dr -John Maddock 1998-2001 all rights reserved.

- -