diff --git a/appendix.htm b/appendix.htm deleted file mode 100644 index ba0b3bdf..00000000 --- a/appendix.htm +++ /dev/null @@ -1,1304 +0,0 @@ - - -
- - - -- -
Regex++, Appendices.-Copyright (c) 1998-2001 -Dr John Maddock -Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty. - |
-
This is the first port of regex++ to the boost library, and is -based on regex++ 2.x, see changes.txt for a full list of changes -from the previous version. There are no known functionality bugs -except that POSIX style equivalence classes are only guaranteed -correct if the Win32 localization model is used (the default for -Win32 builds of the library).
- -There are some aspects of the code that C++ puritans will -consider to be poor style, in particular the use of goto in some -of the algorithms. The code could be cleaned up, by changing to a -recursive implementation, although it is likely to be slower in -that case.
- -The performance of the algorithms should be satisfactory in -most cases. For example the times taken to match the ftp response -expression "^([0-9]+)(\-| |$)(.*)$" against the string -"100- this is a line of ftp response which contains a -message string" are: BSD implementation 450 micro seconds, -GNU implementation 271 micro seconds, regex++ 127 micro seconds (Pentium -P90, Win32 console app under MS Windows 95).
- -However it should be noted that there are some "pathological" -expressions which may require exponential time for matching; -these all involve nested repetition operators, for example -attempting to match the expression "(a*a)*b" against N -letter a's requires time proportional to 2N. -These expressions can (almost) always be rewritten in such a way -as to avoid the problem, for example "(a*a)*b" could be -rewritten as "a*b" which requires only time linearly -proportional to N to solve. In the general case, non-nested -repeat expressions require time proportional to N2, -however if the clauses are mutually exclusive then they can be -matched in linear time - this is the case with "a*b", -for each character the matcher will either match an "a" -or a "b" or fail, where as with "a*a" the -matcher can't tell which branch to take (the first "a" -or the second) and so has to try both. Be careful how you -write your regular expressions and avoid nested repeats if you -can! New to this version, some previously pathological cases have -been fixed - in particular searching for expressions which -contain leading repeats and/or leading literal strings should be -much faster than before. Literal strings are now searched for -using the Knuth/Morris/Pratt algorithm (this is used in -preference to the Boyer/More algorithm because it allows the -tracking of newline characters).
- -Some aspects of the POSIX regular expression syntax are -implementation defined:
- -Class reg_expression<> and its typedefs regex and wregex -are thread safe, in that compiled regular expressions can safely -be shared between threads. The matching algorithms regex_match, -regex_search, regex_grep, regex_format and regex_merge are all re-entrant -and thread safe. Class match_results is now thread safe, in that -the results of a match can be safely copied from one thread to -another (for example one thread may find matches and push -match_results instances onto a queue, while another thread pops -them off the other end), otherwise use a separate instance of -match_results per thread.
- -The POSIX API functions are all re-entrant and thread safe, -regular expressions compiled with regcomp can also be -shared between threads.
- -The class RegEx is only thread safe if each thread gets its -own RegEx instance (apartment threading) - this is a consequence -of RegEx handling both compiling and matching regular expressions. -
- -Finally note that changing the global locale invalidates all -compiled regular expressions, therefore calling set_locale -from one thread while another uses regular expressions will -produce unpredictable results.
- -There is also a requirement that there is only one thread -executing prior to the start of main().
- -Regex++ provides extensive support for run-time -localization, the localization model used can be split into two -parts: front-end and back-end.
- -Front-end localization deals with everything which the user -sees - error messages, and the regular expression syntax itself. -For example a French application could change [[:word:]] to [[:mot:]] -and \w to \m. Modifying the front end locale requires active -support from the developer, by providing the library with a -message catalogue to load, containing the localized strings. -Front-end locale is affected by the LC_MESSAGES category only.
- -Back-end localization deals with everything that occurs after -the expression has been parsed - in other words everything that -the user does not see or interact with directly. It deals with -case conversion, collation, and character class membership. The -back-end locale does not require any intervention from the -developer - the library will acquire all the information it -requires for the current locale from the underlying operating -system / run time library. This means that if the program user -does not interact with regular expressions directly - for example -if the expressions are embedded in your C++ code - then no -explicit localization is required, as the library will take care -of everything for you. For example embedding the expression [[:word:]]+ -in your code will always match a whole word, if the program is -run on a machine with, for example, a Greek locale, then it will -still match a whole word, but in Greek characters rather than -Latin ones. The back-end locale is affected by the LC_TYPE and -LC_COLLATE categories.
- -There are three separate localization mechanisms supported by -regex++:
- -Win32 localization model.
- -This is the default model when the library is compiled under -Win32, and is encapsulated by the traits class w32_regex_traits. -When this model is in effect there is a single global locale as -defined by the user's control panel settings, and returned by -GetUserDefaultLCID. All the settings used by regex++ are acquired -directly from the operating system bypassing the C run time -library. Front-end localization requires a resource dll, -containing a string table with the user-defined strings. The -traits class exports the function:
- -static std::string set_message_catalogue(const std::string& -s);
- -which needs to be called with a string identifying the name of -the resource dll, before your code compiles any regular -expressions (but not necessarily before you construct any reg_expression -instances):
- -boost::w32_regex_traits<char>::set_message_catalogue("mydll.dll"); -
- -Note that this API sets the dll name for both the -narrow and wide character specializations of w32_regex_traits.
- -This model does not currently support thread specific locales -(via SetThreadLocale under Windows NT), the library provides full -Unicode support under NT, under Windows 9x the library degrades -gracefully - characters 0 to 255 are supported, the remainder are -treated as "unknown" graphic characters.
- -C localization model.
- -This is the default model when the library is compiled under -an operating system other than Win32, and is encapsulated by the -traits class c_regex_traits, -Win32 users can force this model to take effect by defining the -pre-processor symbol BOOST_REGEX_USE_C_LOCALE. When this model is -in effect there is a single global locale, as set by setlocale. -All settings are acquired from your run time library, -consequently Unicode support is dependent upon your run time -library implementation. Front end localization requires a POSIX -message catalogue. The traits class exports the function:
- -static std::string set_message_catalogue(const std::string& -s);
- -which needs to be called with a string identifying the name of -the message catalogue, before your code compiles any -regular expressions (but not necessarily before you construct any -reg_expression instances):
- -boost::c_regex_traits<char>::set_message_catalogue("mycatalogue"); -
- -Note that this API sets the dll name for both the -narrow and wide character specializations of c_regex_traits. If -your run time library does not support POSIX message catalogues, -then you can either provide your own implementation of -<nl_types.h> or define BOOST_RE_NO_CAT to disable front-end -localization via message catalogues.
- -Note that calling setlocale invalidates all compiled -regular expressions, calling setlocale(LC_ALL, "C") -will make this library behave equivalent to most traditional -regular expression libraries including version 1 of this library. -
- -C++ localization model. -
- -This model is only in effect if the library is built with the -pre-processor symbol BOOST_REGEX_USE_CPP_LOCALE defined. When -this model is in effect each instance of reg_expression<> -has its own instance of std::locale, class reg_expression<> -also has a member function imbue which allows the locale -for the expression to be set on a per-instance basis. Front end -localization requires a POSIX message catalogue, which will be -loaded via the std::messages facet of the expression's locale, -the traits class exports the symbol:
- -static std::string set_message_catalogue(const std::string& -s);
- -which needs to be called with a string identifying the name of -the message catalogue, before your code compiles any -regular expressions (but not necessarily before you construct any -reg_expression instances):
- -boost::cpp_regex_traits<char>::set_message_catalogue("mycatalogue"); -
- -Note that calling reg_expression<>::imbue will -invalidate any expression currently compiled in that instance of -reg_expression<>. This model is the one which closest fits -the ethos of the C++ standard library, however it is the model -which will produce the slowest code, and which is the least well -supported by current standard library implementations, for -example I have yet to find an implementation of std::locale which -supports either message catalogues, or locales other than "C" -or "POSIX".
- -Finally note that if you build the library with a non-default -localization model, then the appropriate pre-processor symbol (BOOST_REGEX_USE_C_LOCALE -or BOOST_REGEX_USE_CPP_LOCALE) must be defined both when you -build the support library, and when you include <boost/regex.hpp> -or <boost/cregex.hpp> in your code. The best way to ensure -this is to add the #define to <boost/regex/detail/regex_options.hpp>. -
- -Providing a message catalogue:
- -In order to localize the front end of the library, you need to
-provide the library with the appropriate message strings
-contained either in a resource dll's string table (Win32 model),
-or a POSIX message catalogue (C or C++ models). In the latter
-case the messages must appear in message set zero of the
-catalogue. The messages and their id's are as follows:
-
- | Message id | -Meaning | -Default value | -- |
- | 101 | -The character used to start - a sub-expression. | -"(" | -- |
- | 102 | -The character used to end a - sub-expression declaration. | -")" | -- |
- | 103 | -The character used to denote - an end of line assertion. | -"$" | -- |
- | 104 | -The character used to denote - the start of line assertion. | -"^" | -- |
- | 105 | -The character used to denote - the "match any character expression". | -"." | -- |
- | 106 | -The match zero or more times - repetition operator. | -"*" | -- |
- | 107 | -The match one or more - repetition operator. | -"+" | -- |
- | 108 | -The match zero or one - repetition operator. | -"?" | -- |
- | 109 | -The character set opening - character. | -"[" | -- |
- | 110 | -The character set closing - character. | -"]" | -- |
- | 111 | -The alternation operator. | -"|" | -- |
- | 112 | -The escape character. | -"\\" | -- |
- | 113 | -The hash character (not - currently used). | -"#" | -- |
- | 114 | -The range operator. | -"-" | -- |
- | 115 | -The repetition operator - opening character. | -"{" | -- |
- | 116 | -The repetition operator - closing character. | -"}" | -- |
- | 117 | -The digit characters. | -"0123456789" | -- |
- | 118 | -The character which when - preceded by an escape character represents the word - boundary assertion. | -"b" | -- |
- | 119 | -The character which when - preceded by an escape character represents the non-word - boundary assertion. | -"B" | -- |
- | 120 | -The character which when - preceded by an escape character represents the word-start - boundary assertion. | -"<" | -- |
- | 121 | -The character which when - preceded by an escape character represents the word-end - boundary assertion. | -">" | -- |
- | 122 | -The character which when - preceded by an escape character represents any word - character. | -"w" | -- |
- | 123 | -The character which when - preceded by an escape character represents a non-word - character. | -"W" | -- |
- | 124 | -The character which when - preceded by an escape character represents a start of - buffer assertion. | -"`A" | -- |
- | 125 | -The character which when - preceded by an escape character represents an end of - buffer assertion. | -"'z" | -- |
- | 126 | -The newline character. | -"\n" | -- |
- | 127 | -The comma separator. | -"," | -- |
- | 128 | -The character which when - preceded by an escape character represents the bell - character. | -"a" | -- |
- | 129 | -The character which when - preceded by an escape character represents the form feed - character. | -"f" | -- |
- | 130 | -The character which when - preceded by an escape character represents the newline - character. | -"n" | -- |
- | 131 | -The character which when - preceded by an escape character represents the carriage - return character. | -"r" | -- |
- | 132 | -The character which when - preceded by an escape character represents the tab - character. | -"t" | -- |
- | 133 | -The character which when - preceded by an escape character represents the vertical - tab character. | -"v" | -- |
- | 134 | -The character which when - preceded by an escape character represents the start of a - hexadecimal character constant. | -"x" | -- |
- | 135 | -The character which when - preceded by an escape character represents the start of - an ASCII escape character. | -"c" | -- |
- | 136 | -The colon character. | -":" | -- |
- | 137 | -The equals character. | -"=" | -- |
- | 138 | -The character which when - preceded by an escape character represents the ASCII - escape character. | -"e" | -- |
- | 139 | -The character which when - preceded by an escape character represents any lower case - character. | -"l" | -- |
- | 140 | -The character which when - preceded by an escape character represents any non-lower - case character. | -"L" | -- |
- | 141 | -The character which when - preceded by an escape character represents any upper case - character. | -"u" | -- |
- | 142 | -The character which when - preceded by an escape character represents any non-upper - case character. | -"U" | -- |
- | 143 | -The character which when - preceded by an escape character represents any space - character. | -"s" | -- |
- | 144 | -The character which when - preceded by an escape character represents any non-space - character. | -"S" | -- |
- | 145 | -The character which when - preceded by an escape character represents any digit - character. | -"d" | -- |
- | 146 | -The character which when - preceded by an escape character represents any non-digit - character. | -"D" | -- |
- | 147 | -The character which when - preceded by an escape character represents the end quote - operator. | -"E" | -- |
- | 148 | -The character which when - preceded by an escape character represents the start - quote operator. | -"Q" | -- |
- | 149 | -The character which when - preceded by an escape character represents a Unicode - combining character sequence. | -"X" | -- |
- | 150 | -The character which when - preceded by an escape character represents any single - character. | -"C" | -- |
- | 151 | -The character which when - preceded by an escape character represents end of buffer - operator. | -"Z" | -- |
- | 152 | -The character which when - preceded by an escape character represents the - continuation assertion. | -"G" | -- |
- | 153 | -The character which when preceeded by (? indicates a - zero width negated forward lookahead assert. | -! | -- |
-
Custom error messages are loaded as follows:
-
- | Message ID | -Error message ID | -Default string | -- |
- | 201 | -REG_NOMATCH | -"No match" | -- |
- | 202 | -REG_BADPAT | -"Invalid regular - expression" | -- |
- | 203 | -REG_ECOLLATE | -"Invalid collation - character" | -- |
- | 204 | -REG_ECTYPE | -"Invalid character - class name" | -- |
- | 205 | -REG_EESCAPE | -"Trailing backslash" - | -- |
- | 206 | -REG_ESUBREG | -"Invalid back reference" - | -- |
- | 207 | -REG_EBRACK | -"Unmatched [ or [^" - | -- |
- | 208 | -REG_EPAREN | -"Unmatched ( or \\(" - | -- |
- | 209 | -REG_EBRACE | -"Unmatched \\{" | -- |
- | 210 | -REG_BADBR | -"Invalid content of - \\{\\}" | -- |
- | 211 | -REG_ERANGE | -"Invalid range end" - | -- |
- | 212 | -REG_ESPACE | -"Memory exhausted" - | -- |
- | 213 | -REG_BADRPT | -"Invalid preceding - regular expression" | -- |
- | 214 | -REG_EEND | -"Premature end of - regular expression" | -- |
- | 215 | -REG_ESIZE | -"Regular expression too - big" | -- |
- | 216 | -REG_ERPAREN | -"Unmatched ) or \\)" - | -- |
- | 217 | -REG_EMPTY | -"Empty expression" - | -- |
- | 218 | -REG_E_UNKNOWN | -"Unknown error" | -- |
-
Custom character class names are loaded as followed:
-
- | Message ID | -Description | -Equivalent default class - name | -- |
- | 300 | -The character class name for - alphanumeric characters. | -"alnum" | -- |
- | 301 | -The character class name for - alphabetic characters. | -"alpha" | -- |
- | 302 | -The character class name for - control characters. | -"cntrl" | -- |
- | 303 | -The character class name for - digit characters. | -"digit" | -- |
- | 304 | -The character class name for - graphics characters. | -"graph" | -- |
- | 305 | -The character class name for - lower case characters. | -"lower" | -- |
- | 306 | -The character class name for - printable characters. | -"print" | -- |
- | 307 | -The character class name for - punctuation characters. | -"punct" | -- |
- | 308 | -The character class name for - space characters. | -"space" | -- |
- | 309 | -The character class name for - upper case characters. | -"upper" | -- |
- | 310 | -The character class name for - hexadecimal characters. | -"xdigit" | -- |
- | 311 | -The character class name for - blank characters. | -"blank" | -- |
- | 312 | -The character class name for - word characters. | -"word" | -- |
- | 313 | -The character class name for - Unicode characters. | -"unicode" | -- |
-
Finally, custom collating element names are loaded starting -from message id 400, and terminating when the first load -thereafter fails. Each message looks something like: "tagname -string" where tagname is the name used inside [[.tagname.]] -and string is the actual text of the collating element. -Note that the value of collating element [[.zero.]] is used for -the conversion of strings to numbers - if you replace this with -another value then that will be used for string parsing - for -example use the Unicode character 0x0660 for [[.zero.]] if you -want to use Unicode Arabic-Indic digits in your regular -expressions in place of Latin digits.
- -Note that the POSIX defined names for character classes and -collating elements are always available - even if custom names -are defined, in contrast, custom error messages, and custom -syntax messages replace the default ones.
- -There are three demo applications that ship with this library, -they all come with makefiles for Borland, Microsoft and gcc -compilers, otherwise you will have to create your own makefiles.
- -A regression test application that gives the matching/searching -algorithms a full workout. The presence of this program is your -guarantee that the library will behave as claimed - at least as -far as those items tested are concerned - if anyone spots -anything that isn't being tested I'd be glad to hear about it.
- -Files: parse.cpp, regress.cpp, tests.cpp.
- -A simple grep implementation, run with no command line options -to find out its usage. Look at fileiter.cpp/fileiter.hpp -and the mapfile class to see an example of a "smart" -bidirectional iterator that can be used with regex++ or any other -STL algorithm.
- - - -A simple interactive expression matching application, the -results of all matches are timed, allowing the programmer to -optimize their regular expressions where performance is critical. -
- -Files: regex_timer.cpp. -
- -The snippets examples contain the code examples used in the -documentation:
- -regex_match_example.cpp: -ftp based regex_match example.
- -regex_search_example.cpp: -regex_search example: searches a cpp file for class definitions.
- -regex_grep_example_1.cpp: -regex_grep example 1: searches a cpp file for class definitions.
- -regex_merge_example.cpp: -regex_merge example: converts a C++ file to syntax highlighted -HTML.
- -regex_grep_example_2.cpp: -regex_grep example 2: searches a cpp file for class definitions, -using a global callback function.
- -regex_grep_example_3.cpp: -regex_grep example 2: searches a cpp file for class definitions, -using a bound member function callback.
- -regex_grep_example_4.cpp: -regex_grep example 2: searches a cpp file for class definitions, -using a C++ Builder closure as a callback.
- -regex_split_example_1.cpp: -regex_split example: split a string into tokens.
- -regex_split_example_2.cpp: -regex_split example: spit out linked URL's.
- -There are two main headers used by this library: <boost/regex.hpp> -provides full access to the entire library, while <boost/cregex.hpp> -provides access to just the high level class RegEx, and the POSIX -API functions.
- - If you are using Microsoft or Borland C++ and link to a
-dll version of the run time library, then you will also link to
-one of the dll versions of regex++. While these dll's are
-redistributable, there are no "standard" versions, so
-when installing on the users PC, you should place these in a
-directory private to your application, and not in the PC's
-directory path. Note that if you link to a static version of your
-run time library, then you will also link to a static version of
-regex++ and no dll's will need to be distributed. The possible
-regex++ dll and library names are computed according to the
-following formula:
-
"boost_regex_"
-+ BOOST_LIB_TOOLSET
-+ "_"
-+ BOOST_LIB_THREAD_OPT
-+ BOOST_LIB_RT_OPT
-+ BOOST_LIB_LINK_OPT
-+ BOOST_LIB_DEBUG_OPT
-
-These are defined as:
-
-BOOST_LIB_TOOLSET: The compiler toolset name (vc6, vc7, bcb5 etc).
-
-BOOST_LIB_THREAD_OPT: "s" for single thread builds,
-"m" for multithread builds.
-
-BOOST_LIB_RT_OPT: "s" for static runtime,
-"d" for dynamic runtime.
-
-BOOST_LIB_LINK_OPT: "s" for static link,
-"i" for dynamic link.
-
-BOOST_LIB_DEBUG_OPT: nothing for release builds,
-"d" for debug builds,
-"dd" for debug-diagnostic builds (_STLP_DEBUG).
Note: you can disable automatic library selection by defining -the symbol BOOST_REGEX_NO_LIB when compiling, this is useful if -you want to statically link even though you're using the dll -version of your run time library, or if you need to debug regex++. -
- -This version of regex++ is the first to be ported to the boost project, and as a result -has a number of changes to comply with the boost coding -guidelines.
- -Headers have been changed from <header> or <header.h> -to <boost/header.hpp>
- -The library namespace has changed from "jm", to -"boost".
- -The reg_xxx algorithms have been renamed regex_xxx (to improve -naming consistency).
- -Algorithm query_match has been renamed regex_match, and only -returns true if the expression matches the whole of the input -string (think input data validation).
- -Compiling existing code:
- -The directory, libs/regex/old_include contains a set of -headers that make this version of regex++ compatible with -previous ones, either add this directory to your include path, or -copy these headers to the root directory of your boost -installation. The contents of these headers are deprecated and -undocumented - really these are just here for existing code - for -new projects use the new header forms.
- -The author can be contacted at John_Maddock@compuserve.com, -the home page for this library is at http://ourworld.compuserve.com/homepages/John_Maddock/regexpp.htm, -and the official boost version can be obtained from www.boost.org/libraries.htm.
- -I am indebted to Robert Sedgewick's "Algorithms in C++" -for forcing me to think about algorithms and their performance, -and to the folks at boost for forcing me to think, period. -The following people have all contributed useful comments or -fixes: Dave Abrahams, Mike Allison, Edan Ayal, Jayashree -Balasubramanian, Jan Bölsche, Beman Dawes, Paul Baxter, David -Bergman, David Dennerline, Edward Diener, Peter Dimov, Robert -Dunn, Fabio Forno, Tobias Gabrielsson, Rob Gillen, Marc Gregoire, -Chris Hecker, Nick Hodapp, Jesse Jones, Martin Jost, Boris -Krasnovskiy, Jan Hermelink, Max Leung, Wei-hao Lin, Jens Maurer, -Richard Peters, Heiko Schmidt, Jason Shirk, Gerald Slacik, Scobie -Smith, Mike Smyth, Alexander Sokolovsky, Hervé Poirier, Michael -Raykh, Marc Recht, Scott VanCamp, Bruno Voigt, Alexey Voinov, -Jerry Waldorf, Rob Ward, Lealon Watts, Thomas Witt and Yuval -Yosef. I am also grateful to the manuals supplied with the Henry -Spencer, Perl and GNU regular expression libraries - wherever -possible I have tried to maintain compatibility with these -libraries and with the POSIX standard - the code however is -entirely my own, including any bugs! I can absolutely guarantee -that I will not fix any bugs I don't know about, so if you have -any comments or spot any bugs, please get in touch.
- -Useful further information can be found at:
- -A short tutorial on regular expressions can -be found here.
- -The Open -Unix Specification contains a wealth of useful material, -including the regular expression syntax, and specifications for <regex.h> -and <nl_types.h>. -
- -The Pattern -Matching Pointers site is a "must visit" resource -for anyone interested in pattern matching.
- -Glimpse and Agrep, -use a simplified regular expression syntax to achieve faster -search times.
- -Udi Manber -and Ricardo Baeza-Yates -both have a selection of useful pattern matching papers available -from their respective web sites.
- -Copyright Dr -John Maddock 1998-2000 all rights reserved.
- - diff --git a/faq.htm b/faq.htm deleted file mode 100644 index fb3795b6..00000000 --- a/faq.htm +++ /dev/null @@ -1,205 +0,0 @@ - - - - - - -- -
Regex++, FAQ.-Copyright (c) 1998-2001 -Dr John Maddock -Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty. - |
-
Q. Why does using parenthesis in a -regular expression change the result of a match?
- -Parentheses don't only mark; they determine what the best -match is as well. regex++ tries to follow the POSIX standard -leftmost longest rule for determining what matched. So if there -is more than one possible match after considering the whole -expression, it looks next at the first sub-expression and then -the second sub-expression and so on. So...
- -"(0*)([0-9]*)" against "00123" would produce -$1 = "00" -$2 = "123"- -
where as
- -"0*([0-9)*" against "00123" would produce -$1 = "00123"- -
If you think about it, had $1 only matched the "123", -this would be "less good" than the match "00123" -which is both further to the left and longer. If you want $1 to -match only the "123" part, then you need to use -something like:
- -"0*([1-9][0-9]*)"- -
as the expression.
- -Q. Configure says that my compiler is -unable to merge template instances, what does this mean?
- -A. When you compile template code, you can end up with the -same template instances in multiple translation units - this will -lead to link time errors unless your compiler/linker is smart -enough to merge these template instances into a single record in -the executable file. If you see this warning after running -configure, then you can still link to libregex++.a if:
- -Another option is to create a master include file, which -#include's all the regex++ source files, and all the source files -in which you use regex++. You then compile and link this master -file as a single translation unit.
- -Q. Configure says that my compiler is -unable to merge template instances from archive files, what does -this mean?
- -A. When you compile template code, you can end up with the -same template instances in multiple translation units - this will -lead to link time errors unless your compiler/linker is smart -enough to merge these template instances into a single record in -the executable file. Some compilers are able to do this for -normal .cpp or .o files, but fail if the object file has been -placed in a library archive. If you see this warning after -running configure, then you can still link to libregex++.a if:
- -Another option is to add the regex++ source files directly to -your project instead of linking to libregex++.a, generally you -should do this only if you are getting link time errors with -libregex++.a.
- -Q. Configure says that my compiler can't -merge templates containing switch statements, what does this -mean?
- -A. Some compilers can't merge templates that contain static -data - this includes switch statements which implicitly generate -static data as well as code. Principally this affects the egcs -compiler - but note gcc 2.81 also suffers from this problem - the -compiler will compile and link the code - but the code will not -run because the code and the static data it uses have become -separated. The default behaviour of regex++ is to try and fix -this problem by declaring "problem" templates inside -unnamed namespaces, so that the templates have internal linkage. -Note that this can result in a great deal of code bloat. If the -compiler doesn't support namespaces, or if code bloat becomes a -problem, then follow the guidelines above for placing all the -templates used in a single translation unit, and edit boost/regex/config.hpp -so that BOOST_REGEX_NO_TEMPLATE_SWITCH_MERGE is no longer defined. -
- -Q. I can't get regex++ to work with -escape characters, what's going on?
- -A. If you embed regular expressions in C++ code, then remember -that escape characters are processed twice: once by the C++ -compiler, and once by the regex++ expression compiler, so to pass -the regular expression \d+ to regex++, you need to embed "\\d+" -in your code. Likewise to match a literal backslash you will need -to embed "\\\\" in your code.
- -Q. Why don't character ranges work
-properly?
-A. The POSIX standard specifies that character range expressions
-are locale sensitive - so for example the expression [A-Z] will
-match any collating element that collates between 'A' and 'Z'.
-That means that for most locales other than "C" or
-"POSIX", [A-Z] would match the single character 't' for
-example, which is not what most people expect - or at least not
-what most people have come to expect from regular expression
-engines. For this reason, the default behaviour of regex++ is to
-turn locale sensitive collation off by setting the regbase::nocollate
-compile time flag (this is set by regbase::normal). However if
-you set a non-default compile time flag - for example regbase::extended
-or regbase::basic, then locale dependent collation will be
-enabled, this also applies to the POSIX API functions which use
-either regbase::extended or regbase::basic internally, in the
-latter case use REG_NOCOLLATE in combination with either
-REG_BASIC or REG_EXTENDED when invoking regcomp if you don't want
-locale sensitive collation. [Note - when regbase::nocollate in
-effect, the library behaves "as if" the LC_COLLATE
-locale category were always "C", regardless of what its
-actually set to - end note].
Q. Why can't I use the "convenience" -versions of query_match/reg_search/reg_grep/reg_format/reg_merge? -
- -A. These versions may or may not be available depending upon -the capabilities of your compiler, the rules determining the -format of these functions are quite complex - and only the -versions visible to a standard compliant compiler are given in -the help. To find out what your compiler supports, run <boost/regex.hpp> -through your C++ pre-processor, and search the output file for -the function that you are interested in.
- -Q. Why are there no throw specifications -on any of the functions? What exceptions can the library throw? -
- -A. Not all compilers support (or honor) throw specifications, -others support them but with reduced efficiency. Throw -specifications may be added at a later date as compilers begin to -handle this better. The library should throw only three types of -exception: boost::bad_expression can be thrown by reg_expression -when compiling a regular expression, std::runtime_error can be -thrown when a call to reg_expression::imbue tries to open a -message catalogue that doesn't exist or when a call to RegEx::GrepFiles -or RegEx::FindFiles tries to open a file that cannot be opened, -finally std::bad_alloc can be thrown by just about any of the -functions in this library.
- -Copyright Dr -John Maddock 1998-2000 all rights reserved.
- - diff --git a/format_string.htm b/format_string.htm deleted file mode 100644 index 41a33842..00000000 --- a/format_string.htm +++ /dev/null @@ -1,243 +0,0 @@ - - - - - - -- -
Regex++, Format - String Reference.-Copyright (c) 1998-2001 -Dr John Maddock -Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty. - |
-
Format strings are used by the algorithms regex_format and regex_merge, and are -used to transform one string into another.
- -There are three kind of format string: sed, perl and extended, -the extended syntax is the default so this is covered first.
- -Extended format syntax
- -In format strings, all characters are treated as literals -except: ()$\?:
- -To use any of these as literals you must prefix them with the -escape character \
- -The following special sequences are recognized:
-
-
Grouping:
- -Use the parenthesis characters ( and ) to group sub-expressions
-within the format string, use \( and \) to represent literal '('
-and ')'.
-
-
Sub-expression expansions:
- -The following perl like expressions expand to a particular
-matched sub-expression:
-
- | $` | -Expands to all the text from - the end of the previous match to the start of the current - match, if there was no previous match in the current - operation, then everything from the start of the input - string to the start of the match. | -- |
- | $' | -Expands to all the text from - the end of the match to the end of the input string. | -- |
- | $& | -Expands to all of the - current match. | -- |
- | $0 | -Expands to all of the - current match. | -- |
- | $N | -Expands to the text that - matched sub-expression N. | -- |
-
Conditional expressions:
- -Conditional expressions allow two different format strings to -be selected dependent upon whether a sub-expression participated -in the match or not:
- -?Ntrue_expression:false_expression
- -Executes true_expression if sub-expression N -participated in the match, otherwise executes false_expression.
- -Example: suppose we search for "(while)|(for)" then
-the format string "?1WHILE:FOR" would output what
-matched, but in upper case.
-
-
Escape sequences:
- -The following escape sequences are also allowed:
-
- | \a | -The bell character. | -- |
- | \f | -The form feed character. | -- |
- | \n | -The newline character. | -- |
- | \r | -The carriage return - character. | -- |
- | \t | -The tab character. | -- |
- | \v | -A vertical tab character. | -- |
- | \x | -A hexadecimal character - - for example \x0D. | -- |
- | \x{} | -A possible unicode - hexadecimal character - for example \x{1A0} | -- |
- | \cx | -The ASCII escape character - x, for example \c@ is equivalent to escape-@. | -- |
- | \e | -The ASCII escape character. | -- |
- | \dd | -An octal character constant, - for example \10. | -- |
-
Perl format strings
- -Perl format strings are the same as the default syntax except -that the characters ()?: have no special meaning.
- -Sed format strings
- -Sed format strings use only the characters \ and & as -special characters.
- -\n where n is a digit, is expanded to the nth sub-expression.
- -& is expanded to the whole of the match (equivalent to \0). -
- -Other escape sequences are expanded as per the default syntax.
-
-
Copyright Dr -John Maddock 1998-2000 all rights reserved.
- - diff --git a/hl_ref.htm b/hl_ref.htm deleted file mode 100644 index 44b803a1..00000000 --- a/hl_ref.htm +++ /dev/null @@ -1,572 +0,0 @@ - - - - - - -- -
Regex++, RegEx Class - Reference.-Copyright (c) 1998-2001 -Dr John Maddock -Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty. - |
-
#include <boost/cregex.hpp>
- -The class RegEx provides a high level simplified interface to -the regular expression library, this class only handles narrow -character strings, and regular expressions always follow the -"normal" syntax - that is the same as the standard -POSIX extended syntax, but with locale specific collation -disabled, and escape characters inside character set declarations -are allowed.
- -typedef bool (*GrepCallback)(const RegEx& expression); -typedef bool (*GrepFileCallback)(const char* file, const RegEx& expression); -typedef bool (*FindFilesCallback)(const char* file); - -class RegEx -{ -public: - RegEx(); - RegEx(const RegEx& o); - ~RegEx(); - RegEx(const char* c, bool icase = false); - explicit RegEx(const std::string& s, bool icase = false); - RegEx& operator=(const RegEx& o); - RegEx& operator=(const char* p); - RegEx& operator=(const std::string& s); - unsigned int SetExpression(const char* p, bool icase = false); - unsigned int SetExpression(const std::string& s, bool icase = false); - std::string Expression()const; - // - // now matching operators: - // - bool Match(const char* p, unsigned int flags = match_default); - bool Match(const std::string& s, unsigned int flags = match_default); - bool Search(const char* p, unsigned int flags = match_default); - bool Search(const std::string& s, unsigned int flags = match_default); - unsigned int Grep(GrepCallback cb, const char* p, unsigned int flags = match_default); - unsigned int Grep(GrepCallback cb, const std::string& s, unsigned int flags = match_default); - unsigned int Grep(std::vector<std::string>& v, const char* p, unsigned int flags = match_default); - unsigned int Grep(std::vector<std::string>& v, const std::string& s, unsigned int flags = match_default); - unsigned int Grep(std::vector<unsigned int>& v, const char* p, unsigned int flags = match_default); - unsigned int Grep(std::vector<unsigned int>& v, const std::string& s, unsigned int flags = match_default); - unsigned int GrepFiles(GrepFileCallback cb, const char* files, bool recurse = false, unsigned int flags = match_default); - unsigned int GrepFiles(GrepFileCallback cb, const std::string& files, bool recurse = false, unsigned int flags = match_default); - unsigned int FindFiles(FindFilesCallback cb, const char* files, bool recurse = false, unsigned int flags = match_default); - unsigned int FindFiles(FindFilesCallback cb, const std::string& files, bool recurse = false, unsigned int flags = match_default); - std::string Merge(const std::string& in, const std::string& fmt, bool copy = true, unsigned int flags = match_default); - std::string Merge(const char* in, const char* fmt, bool copy = true, unsigned int flags = match_default); - unsigned Split(std::vector<std::string>& v, std::string& s, unsigned flags = match_default, unsigned max_count = ~0); - // - // now operators for returning what matched in more detail: - // - unsigned int Position(int i = 0)const; - unsigned int Length(int i = 0)const; - bool Matched(int i = 0)const; - unsigned int Line()const; - unsigned int Marks() const; - std::string What(int i)const; - std::string operator[](int i)const ; - - static const unsigned int npos; -};- -
Member functions for class RegEx are defined as follows:
-
- | RegEx(); | -Default constructor, - constructs an instance of RegEx without any valid - expression. | -- |
- | RegEx(const - RegEx& o); | -Copy constructor, all the - properties of parameter o are copied. | -- |
- | RegEx(const char* - c, bool icase = false); | -Constructs an instance of - RegEx, setting the expression to c, if icase - is true then matching is insensitive to case, - otherwise it is sensitive to case. Throws bad_expression - on failure. | -- |
- | RegEx(const std::string& - s, bool icase = false); | -Constructs an instance of - RegEx, setting the expression to s, if icase is - true then matching is insensitive to case, - otherwise it is sensitive to case. Throws bad_expression - on failure. | -- |
- | RegEx& operator=(const - RegEx& o); | -Default assignment operator. | -- |
- | RegEx& operator=(const - char* p); | -Assignment operator, - equivalent to calling SetExpression(p, false). - Throws bad_expression on failure. | -- |
- | RegEx& operator=(const - std::string& s); | -Assignment operator, - equivalent to calling SetExpression(s, false). - Throws bad_expression on failure. | -- |
- | unsigned int - SetExpression(constchar* p, bool icase = false); | -Sets the current expression - to p, if icase is true then matching - is insensitive to case, otherwise it is sensitive to case. - Throws bad_expression on failure. | -- |
- | unsigned int - SetExpression(const std::string& s, bool - icase = false); | -Sets the current expression - to s, if icase is true then matching - is insensitive to case, otherwise it is sensitive to case. - Throws bad_expression on failure. | -- |
- | std::string Expression()const; | -Returns a copy of the - current regular expression. | -- |
- | bool Match(const - char* p, unsigned int flags = - match_default); | -Attempts to match the - current expression against the text p using the - match flags flags - see match flags. - Returns true if the expression matches the whole - of the input string. | -- |
- | bool Match(const - std::string& s, unsigned int flags = - match_default) ; | -Attempts to match the - current expression against the text s using the - match flags flags - see match flags. - Returns true if the expression matches the whole - of the input string. | -- |
- | bool Search(const - char* p, unsigned int flags = - match_default); | -Attempts to find a match for - the current expression somewhere in the text p - using the match flags flags - see match flags. - Returns true if the match succeeds. | -- |
- | bool Search(const - std::string& s, unsigned int flags = - match_default) ; | -Attempts to find a match for - the current expression somewhere in the text s - using the match flags flags - see match flags. - Returns true if the match succeeds. | -- |
- | unsigned int - Grep(GrepCallback cb, const char* p, unsigned - int flags = match_default); | -Finds all matches of the
- current expression in the text p using the match
- flags flags - see match flags.
- For each match found calls the call-back function cb
- as: cb(*this); If at any stage the call-back function - returns false then the grep operation terminates, - otherwise continues until no further matches are found. - Returns the number of matches found. - |
- - |
- | unsigned int - Grep(GrepCallback cb, const std::string& s, unsigned - int flags = match_default); | -Finds all matches of the
- current expression in the text s using the match
- flags flags - see match flags.
- For each match found calls the call-back function cb
- as: cb(*this); If at any stage the call-back function - returns false then the grep operation terminates, - otherwise continues until no further matches are found. - Returns the number of matches found. - |
- - |
- | unsigned int - Grep(std::vector<std::string>& v, const char* - p, unsigned int flags = match_default); | -Finds all matches of the - current expression in the text p using the match - flags flags - see match flags. - For each match pushes a copy of what matched onto v. - Returns the number of matches found. | -- |
- | unsigned int - Grep(std::vector<std::string>& v, const - std::string& s, unsigned int flags = - match_default); | -Finds all matches of the - current expression in the text s using the match - flags flags - see match flags. - For each match pushes a copy of what matched onto v. - Returns the number of matches found. | -- |
- | unsigned int - Grep(std::vector<unsigned int>& v, const - char* p, unsigned int flags = - match_default); | -Finds all matches of the - current expression in the text p using the match - flags flags - see match flags. - For each match pushes the starting index of what matched - onto v. Returns the number of matches found. | -- |
- | unsigned int - Grep(std::vector<unsigned int>& v, const - std::string& s, unsigned int flags = - match_default); | -Finds all matches of the - current expression in the text s using the match - flags flags - see match flags. - For each match pushes the starting index of what matched - onto v. Returns the number of matches found. | -- |
- | unsigned int - GrepFiles(GrepFileCallback cb, const char* - files, bool recurse = false, unsigned - int flags = match_default); | -Finds all matches of the
- current expression in the files files using the
- match flags flags - see match flags.
- For each match calls the call-back function cb. If - the call-back returns false then the algorithm returns - without considering further matches in the current file, - or any further files. -The parameter files can include wild card - characters '*' and '?', if the parameter recurse - is true then searches sub-directories for matching file - names. -Returns the total number of matches found. -May throw an exception derived from std::runtime_error - if file io fails. - |
- - |
- | unsigned int - GrepFiles(GrepFileCallback cb, const std::string& - files, bool recurse = false, unsigned - int flags = match_default); | -Finds all matches of the
- current expression in the files files using the
- match flags flags - see match flags.
- For each match calls the call-back function cb. If - the call-back returns false then the algorithm returns - without considering further matches in the current file, - or any further files. -The parameter files can include wild card - characters '*' and '?', if the parameter recurse - is true then searches sub-directories for matching file - names. -Returns the total number of matches found. -May throw an exception derived from std::runtime_error - if file io fails. - |
- - |
- | unsigned int - FindFiles(FindFilesCallback cb, const char* - files, bool recurse = false, unsigned - int flags = match_default); | -Searches files to
- find all those which contain at least one match of the
- current expression using the match flags flags -
- see match
- flags. For each matching file calls the call-back
- function cb. If the call-back returns false then - the algorithm returns without considering any further - files. -The parameter files can include wild card - characters '*' and '?', if the parameter recurse - is true then searches sub-directories for matching file - names. -Returns the total number of files found. -May throw an exception derived from std::runtime_error - if file io fails. - |
- - |
- | unsigned int - FindFiles(FindFilesCallback cb, const std::string& - files, bool recurse = false, unsigned - int flags = match_default); | -Searches files to
- find all those which contain at least one match of the
- current expression using the match flags flags -
- see match
- flags. For each matching file calls the call-back
- function cb. If the call-back returns false then - the algorithm returns without considering any further - files. -The parameter files can include wild card - characters '*' and '?', if the parameter recurse - is true then searches sub-directories for matching file - names. -Returns the total number of files found. -May throw an exception derived from std::runtime_error - if file io fails. - |
- - |
- | std::string Merge(const - std::string& in, const std::string& fmt, bool - copy = true, unsigned int flags = - match_default); | -Performs a search and - replace operation: searches through the string in - for all occurrences of the current expression, for each - occurrence replaces the match with the format string fmt. - Uses flags to determine what gets matched, and how - the format string should be treated. If copy is - true then all unmatched sections of input are copied - unchanged to output, if the flag format_first_only - is set then only the first occurance of the pattern found - is replaced. Returns the new string. See also format string - syntax, match - flags and format flags. | -- |
- | std::string Merge(const - char* in, const char* fmt, bool copy = true, - unsigned int flags = match_default); | -Performs a search and - replace operation: searches through the string in - for all occurrences of the current expression, for each - occurrence replaces the match with the format string fmt. - Uses flags to determine what gets matched, and how - the format string should be treated. If copy is - true then all unmatched sections of input are copied - unchanged to output, if the flag format_first_only - is set then only the first occurance of the pattern found - is replaced. Returns the new string. See also format string - syntax, match - flags and format flags. | -- |
- | unsigned Split(std::vector<std::string>& - v, std::string& s, unsigned flags = - match_default, unsigned max_count = ~0); | -Splits the input string and pushes each - one onto the vector. If the expression contains no marked - sub-expressions, then one string is outputted for each - section of the input that does not match the expression. - If the expression does contain marked sub-expressions, - then outputs one string for each marked sub-expression - each time a match occurs. Outputs no more than max_count - strings. Before returning, deletes from the input - string s all of the input that has been processed - (all of the string if max_count was not reached). - Returns the number of strings pushed onto the vector. | -- |
- | unsigned int - Position(int i = 0)const; | -Returns the position of what - matched sub-expression i. If i = 0 then - returns the position of the whole match. Returns RegEx::npos - if the supplied index is invalid, or if the specified sub-expression - did not participate in the match. | -- |
- | unsigned int - Length(int i = 0)const; | -Returns the length of what - matched sub-expression i. If i = 0 then - returns the length of the whole match. Returns RegEx::npos - if the supplied index is invalid, or if the specified sub-expression - did not participate in the match. | -- |
- | bool Matched(int i - = 0)const; | -Returns true if sub-expression i was - matched, false otherwise. | -- |
- | unsigned int - Line()const; | -Returns the line on which - the match occurred, indexes start from 1 not zero, if no - match occurred then returns RegEx::npos. | -- |
- | unsigned int Marks() - const; | -Returns the number of marked - sub-expressions contained in the expression. Note that - this includes the whole match (sub-expression zero), so - the value returned is always >= 1. | -- |
- | std::string What(int - i)const; | -Returns a copy of what - matched sub-expression i. If i = 0 then - returns a copy of the whole match. Returns a null string - if the index is invalid or if the specified sub-expression - did not participate in a match. | -- |
- | std::string operator[](int - i)const ; | -Returns what(i); Can - be used to simplify access to sub-expression matches, and - make usage more perl-like. - |
- - |
Copyright Dr -John Maddock 1998-2000 all rights reserved.
- - diff --git a/index.htm b/index.htm deleted file mode 100644 index f313dd7c..00000000 --- a/index.htm +++ /dev/null @@ -1,150 +0,0 @@ - - - - - - - -- -
Regex++, Index.-(Version 3.31, 16th Dec 2001) - -Copyright (c) 1998-2001 -Dr John Maddock -Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty. - |
-
Copyright Dr -John Maddock 1998-2001 all rights reserved.
- - diff --git a/introduction.htm b/introduction.htm deleted file mode 100644 index bcac99bb..00000000 --- a/introduction.htm +++ /dev/null @@ -1,476 +0,0 @@ - - - - - - - -- -
Regex++, Introduction.-Copyright (c) 1998-2001 -Dr John Maddock -Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty. - |
-
Regular expressions are a form of pattern-matching that are -often used in text processing; many users will be familiar with -the Unix utilities grep, sed and awk, and -the programming language perl, each of which make -extensive use of regular expressions. Traditionally C++ users -have been limited to the POSIX C API's for manipulating regular -expressions, and while regex++ does provide these API's, they do -not represent the best way to use the library. For example regex++ -can cope with wide character strings, or search and replace -operations (in a manner analogous to either sed or perl), -something that traditional C libraries can not do.
- -The class boost::reg_expression -is the key class in this library; it represents a "machine -readable" regular expression, and is very closely modelled -on std::basic_string, think of it as a string plus the actual -state-machine required by the regular expression algorithms. Like -std::basic_string there are two typedefs that are almost always -the means by which this class is referenced:
- -namespace boost{ - -template <class charT, - class traits = regex_traits<charT>, - class Allocator = std::allocator<charT> > -class reg_expression; - -typedef reg_expression<char> regex; -typedef reg_expression<wchar_t> wregex; - -}- -
To see how this library can be used, imagine that we are -writing a credit card processing application. Credit card numbers -generally come as a string of 16-digits, separated into groups of -4-digits, and separated by either a space or a hyphen. Before -storing a credit card number in a database (not necessarily -something your customers will appreciate!), we may want to verify -that the number is in the correct format. To match any digit we -could use the regular expression [0-9], however ranges of -characters like this are actually locale dependent. Instead we -should use the POSIX standard form [[:digit:]], or the regex++ -and perl shorthand for this \d (note that many older libraries -tended to be hard-coded to the C-locale, consequently this was -not an issue for them). That leaves us with the following regular -expression to validate credit card number formats:
- -(\d{4}[- ]){3}\d{4}
- -Here the parenthesis act to group (and mark for future -reference) sub-expressions, and the {4} means "repeat -exactly 4 times". This is an example of the extended regular -expression syntax used by perl, awk and egrep. Regex++ also -supports the older "basic" syntax used by sed and grep, -but this is generally less useful, unless you already have some -basic regular expressions that you need to reuse.
- -Now lets take that expression and place it in some C++ code to -validate the format of a credit card number:
- -bool validate_card_format(const std::string s) -{ - static const boost::regex e("(\\d{4}[- ]){3}\\d{4}"); - return regex_match(s, e); -}- -
Note how we had to add some extra escapes to the expression: -remember that the escape is seen once by the C++ compiler, before -it gets to be seen by the regular expression engine, consequently -escapes in regular expressions have to be doubled up when -embedding them in C/C++ code. Also note that all the examples -assume that your compiler supports Koenig lookup, if yours -doesn't (for example VC6), then you will have to add some boost:: -prefixes to some of the function calls in the examples.
- -Those of you who are familiar with credit card processing, -will have realised that while the format used above is suitable -for human readable card numbers, it does not represent the format -required by online credit card systems; these require the number -as a string of 16 (or possibly 15) digits, without any -intervening spaces. What we need is a means to convert easily -between the two formats, and this is where search and replace -comes in. Those who are familiar with the utilities sed -and perl will already be ahead here; we need two strings - -one a regular expression - the other a "format string" that provides a -description of the text to replace the match with. In regex++ -this search and replace operation is performed with the algorithm -regex_merge, for our credit card example we can write two -algorithms like this to provide the format conversions:
- --// match any format with the regular expression: -const boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z"); -const std::string machine_format("\\1\\2\\3\\4"); -const std::string human_format("\\1-\\2-\\3-\\4"); - -std::string machine_readable_card_number(const std::string s) -{ - return regex_merge(s, e, machine_format, boost::match_default | boost::format_sed); -} - -std::string human_readable_card_number(const std::string s) -{ - return regex_merge(s, e, human_format, boost::match_default | boost::format_sed); -}- -
Here we've used marked sub-expressions in the regular -expression to split out the four parts of the card number as -separate fields, the format string then uses the sed-like syntax -to replace the matched text with the reformatted version.
- -In the examples above, we haven't directly manipulated the -results of a regular expression match, however in general the -result of a match contains a number of sub-expression matches in -addition to the overall match. When the library needs to report a -regular expression match it does so using an instance of the -class match_results, -as before there are typedefs of this class for the most common -cases:
- -namespace boost{ -typedef match_results<const char*> cmatch; -typedef match_results<const wchar_t*> wcmatch; -typedef match_results<std::string::const_iterator> smatch; -typedef match_results<std::wstring::const_iterator> wsmatch; -}- -
The algorithms regex_search -and regex_grep (i.e. -finding all matches in a string) make use of match_results to -report what matched.
- -Note that these algorithms are not restricted to searching -regular C-strings, any bidirectional iterator type can be -searched, allowing for the possibility of seamlessly searching -almost any kind of data.
- -For search and replace operations in addition to the algorithm -regex_merge that -we have already seen, the algorithm regex_format takes -the result of a match and a format string, and produces a new -string by merging the two.
- -For those that dislike templates, there is a high level -wrapper class RegEx that is an encapsulation of the lower level -template code - it provides a simplified interface for those that -don't need the full power of the library, and supports only -narrow characters, and the "extended" regular -expression syntax.
- -The POSIX API functions: -regcomp, regexec, regfree and regerror, are available in both -narrow character and Unicode versions, and are provided for those -who need compatibility with these API's.
- -Finally, note that the library now has run-time localization support, and -recognizes the full POSIX regular expression syntax - including -advanced features like multi-character collating elements and -equivalence classes - as well as providing compatibility with -other regular expression libraries including GNU and BSD4 regex -packages, and to a more limited extent perl 5.
- -[ Important: If you are -upgrading from the 2.x version of this library then you will find -a number of changes to the documented header names and library -interfaces, existing code should still compile unchanged however -- see Note -for Upgraders. ]
- -When you extract the library from its zip file, you must -preserve its internal directory structure (for example by using -the -d option when extracting). If you didn't do that when -extracting, then you'd better stop reading this, delete the files -you just extracted, and try again!
- -This library should not need configuring before use; most -popular compilers/standard libraries/platforms are already -supported "as is". If you do experience configuration -problems, or just want to test the configuration with your -compiler, then the process is the same as for all of boost; see -the configuration library -documentation.
- -The library will encase all code inside namespace boost.
- -Unlike some other template libraries, this library consists of -a mixture of template code (in the headers) and static code and -data (in cpp files). Consequently it is necessary to build the -library's support code into a library or archive file before you -can use it, instructions for specific platforms are as follows:
- -Borland C++ Builder:
- -make -fbcb5.mak- -
The build process will build a variety of .lib and .dll files -(the exact number depends upon the version of Borland's tools you -are using) the .lib and dll files will be in a sub-directory -called bcb4 or bcb5 depending upon the makefile used. To install -the libraries into your development system use:
- -make -fbcb5.mak install
- -library files will be copied to <BCROOT>/lib and the -dll's to <BCROOT>/bin, where <BCROOT> corresponds to -the install path of your Borland C++ tools.
- -You may also remove temporary files created during the build -process (excluding lib and dll files) by using:
- -make -fbcb5.mak clean
- -Finally when you use regex++ it is only necessary for you to -add the <boost> root director to your list of include -directories for that project. It is not necessary for you to -manually add a .lib file to the project; the headers will -automatically select the correct .lib file for your build mode -and tell the linker to include it. There is one caveat however: -the library can not tell the difference between VCL and non-VCL -enabled builds when building a GUI application from the command -line, if you build from the command line with the 5.5 command -line tools then you must define the pre-processor symbol _NO_VCL -in order to ensure that the correct link libraries are selected: -the C++ Builder IDE normally sets this automatically. Hint, users -of the 5.5 command line tools may want to add a -D_NO_VCL to bcc32.cfg -in order to set this option permanently.
- -If you would prefer to do a static link to the regex libraries -even when using the dll runtime then define -BOOST_REGEX_STATIC_LINK, and if you want to suppress automatic -linking altogether (and supply your own custom build of the lib) -then define BOOST_REGEX_NO_LIB.
- -If you are building with C++ Builder 6, you will find that -<boost/regex.hpp> can not be used in a pre-compiled header -(the actual problem is in <locale> which gets included by -<boost/regex.hpp>), if this causes problems for you, then -try defining BOOST_NO_STD_LOCALE when building, this will disable -some features throughout boost, but may save you a lot in compile -times!
- -Microsoft Visual C++ 6 and 7
- -You need version 6 of MSVC to build this library. If you are -using VC5 then you may want to look at one of the previous -releases of this library -
- -Open up a command prompt, which has the necessary MSVC -environment variables defined (for example by using the batch -file Vcvars32.bat installed by the Visual Studio installation), -and change to the <boost>\libs\regex\build directory.
- -Select the correct makefile - vc6.mak for "vanilla" -Visual C++ 6 or vc6-stlport.mak if you are using STLPort.
- -Invoke the makefile like this:
- -nmake -fvc6.mak
- -You will now have a collection of lib and dll files in a -"vc6" subdirectory, to install these into your -development system use:
- -nmake -fvc6.mak install
- -The lib files will be copied to your <VC6>\lib directory -and the dll files to <VC6>\bin, where <VC6> is the -root of your Visual C++ 6 installation.
- -You can delete all the temporary files created during the -build (excluding lib and dll files) using:
- -nmake -fvc6.mak clean
- -Finally when you use regex++ it is only necessary for you to -add the <boost> root directory to your list of include -directories for that project. It is not necessary for you to -manually add a .lib file to the project; the headers will -automatically select the correct .lib file for your build mode -and tell the linker to include it.
- -Note that if you want to statically link to the regex library -when using the dynamic C++ runtime, define -BOOST_REGEX_STATIC_LINK when building your project (this only has -an effect for release builds). If you want to add the source -directly to your project then define BOOST_REGEX_NO_LIB to -disable automatic library selection.
- -Important: there have been some -reports of compiler-optimisation bugs affecting this library, (particularly -with VC6 versions prior to service patch 5) the workaround is to -build the library using /Oityb1 rather than /O2. That is to use -all optimisation settings except /Oa. This problem is reported to -affect some standard library code as well (in fact I'm not sure -if the problem is with the regex code or the underlying standard -library), so it's probably worthwhile applying this workaround in -normal practice in any case.
- -Note: if you have replaced the C++ standard library that comes -with VC6, then when you build the library you must ensure that -the environment variables "INCLUDE" and "LIB" -have been updated to reflect the include and library paths for -the new library - see vcvars32.bat (part of your Visual Studio -installation) for more details. Alternatively if STLPort is in c:/stlport -then you could use:
- -nmake INCLUDES="-Ic:/stlport/stlport" XLFLAGS="/LIBPATH:c:/stlport/lib" --fvc6-stlport.mak
- -If you are building with the full STLPort v4.x, then use the
-vc6-stlport.mak file provided and set the environment variable
-STLPORT_PATH to point to the location of your STLport
-installation (Note that the full STLPort libraries appear not to
-support single-thread static builds).
-
-
GCC(2.95)
- -There is a conservative makefile for the g++ compiler. From -the command prompt change to the <boost>/libs/regex/build -directory and type:
- -make -fgcc.mak
- -At the end of the build process you should have a gcc sub-directory -containing release and debug versions of the library (libboost_regex.a -and libboost_regex_debug.a). When you build projects that use -regex++, you will need to add the boost install directory to your -list of include paths and add <boost>/libs/regex/build/gcc/libboost_regex.a -to your list of library files.
- -There is also a makefile to build the library as a shared -library:
- -make -fgcc-shared.mak
- -which will build libboost_regex.so and libboost_regex_debug.so.
- -Both of the these makefiles support the following environment -variables:
- -CXXFLAGS: extra compiler options - note that this applies to -both the debug and release builds.
- -INCLUDES: additional include directories.
- -LDFLAGS: additional linker options.
- -LIBS: additional library files.
- -For the more adventurous there is a configure script in -<boost>/libs/config; see the config -library documentation.
- -Sun Workshop 6.1
- -There is a makefile for the sun (6.1) compiler (C++ version 3.12). -From the command prompt change to the <boost>/libs/regex/build -directory and type:
- -dmake -f sunpro.mak
- -At the end of the build process you should have a sunpro sub-directory -containing single and multithread versions of the library (libboost_regex.a, -libboost_regex.so, libboost_regex_mt.a and libboost_regex_mt.so). -When you build projects that use regex++, you will need to add -the boost install directory to your list of include paths and add -<boost>/libs/regex/build/sunpro/ to your library search -path.
- -Both of the these makefiles support the following environment -variables:
- -CXXFLAGS: extra compiler options - note that this applies to -both the single and multithreaded builds.
- -INCLUDES: additional include directories.
- -LDFLAGS: additional linker options.
- -LIBS: additional library files.
- -LIBSUFFIX: a suffix to mangle the library name with (defaults -to nothing).
- -This makefile does not set any architecture specific options -like -xarch=v9, you can set these by defining the appropriate -macros, for example:
- -dmake CXXFLAGS="-xarch=v9" LDFLAGS="-xarch=v9" -LIBSUFFIX="_v9" -f sunpro.mak
- -will build v9 variants of the regex library named -libboost_regex_v9.a etc.
- -Other compilers:
- -There is a generic makefile (generic.mak) -provided in <boost-root>/libs/regex/build - see that -makefile for details of environment variables that need to be set -before use. Alternatively you can using the Jam based build system. -If you need to configure the library for your platform, then -refer to the config library -documentation.
- -Copyright Dr -John Maddock 1998-2001 all rights reserved.
- - diff --git a/posix_ref.htm b/posix_ref.htm deleted file mode 100644 index ffe2e677..00000000 --- a/posix_ref.htm +++ /dev/null @@ -1,314 +0,0 @@ - - - - - - -- -
Regex++, POSIX API - Reference.-Copyright (c) 1998-2001 -Dr John Maddock -Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty. - |
-
#include <boost/cregex.hpp> -or: -#include <boost/regex.h>- -
The following functions are available for users who need a -POSIX compatible C library, they are available in both Unicode -and narrow character versions, the standard POSIX API names are -macros that expand to one version or the other depending upon -whether UNICODE is defined or not.
- -Important: Note that all the symbols defined here are -enclosed inside namespace boost when used in C++ programs, -unless you use #include <boost/regex.h> instead - in which -case the symbols are still defined in namespace boost, but are -made available in the global namespace as well.
- -The functions are defined as:
- -extern "C" { -int regcompA(regex_tA*, const char*, int); -unsigned int regerrorA(int, const regex_tA*, char*, unsigned int); -int regexecA(const regex_tA*, const char*, unsigned int, regmatch_t*, int); -void regfreeA(regex_tA*); - -int regcompW(regex_tW*, const wchar_t*, int); -unsigned int regerrorW(int, const regex_tW*, wchar_t*, unsigned int); -int regexecW(const regex_tW*, const wchar_t*, unsigned int, regmatch_t*, int); -void regfreeW(regex_tW*); - -#ifdef UNICODE -#define regcomp regcompW -#define regerror regerrorW -#define regexec regexecW -#define regfree regfreeW -#define regex_t regex_tW -#else -#define regcomp regcompA -#define regerror regerrorA -#define regexec regexecA -#define regfree regfreeA -#define regex_t regex_tA -#endif -}- -
All the functions operate on structure regex_t, which -exposes two public members:
- -unsigned int re_nsub this is filled in by regcomp -and indicates the number of sub-expressions contained in the -regular expression.
- -const TCHAR* re_endp points to the end of the -expression to compile when the flag REG_PEND is set.
- -Footnote: regex_t is actually a #define - it is either -regex_tA or regex_tW depending upon whether UNICODE is defined or -not, TCHAR is either char or wchar_t again depending upon the -macro UNICODE.
- -regcomp takes a pointer to a regex_t, a pointer
-to the expression to compile and a flags parameter which can be a
-combination of:
-
- | REG_EXTENDED | -Compiles modern regular - expressions. Equivalent to regbase::char_classes | - regbase::intervals | regbase::bk_refs. | -- |
- | REG_BASIC | -Compiles basic (obsolete) - regular expression syntax. Equivalent to regbase::char_classes - | regbase::intervals | regbase::limited_ops | regbase::bk_braces - | regbase::bk_parens | regbase::bk_refs. | -- |
- | REG_NOSPEC | -All characters are ordinary, - the expression is a literal string. | -- |
- | REG_ICASE | -Compiles for matching that - ignores character case. | -- |
- | REG_NOSUB | -Has no effect in this - library. | -- |
- | REG_NEWLINE | -When this flag is set a dot - does not match the newline character. | -- |
- | REG_PEND | -When this flag is set the - re_endp parameter of the regex_t structure must point to - the end of the regular expression to compile. | -- |
- | REG_NOCOLLATE | -When this flag is set then - locale dependent collation for character ranges is turned - off. | -- |
- | REG_ESCAPE_IN_LISTS - , , , |
- When this flag is set, then - escape sequences are permitted in bracket expressions (character - sets). | -- |
- | REG_NEWLINE_ALT | -When this flag is set then - the newline character is equivalent to the alternation - operator |. | -- |
- | REG_PERL | -A shortcut for perl-like - behavior: REG_EXTENDED | REG_NOCOLLATE | - REG_ESCAPE_IN_LISTS | -- |
- | REG_AWK | -A shortcut for awk-like - behavior: REG_EXTENDED | REG_ESCAPE_IN_LISTS | -- |
- | REG_GREP | -A shortcut for grep like - behavior: REG_BASIC | REG_NEWLINE_ALT | -- |
- | REG_EGREP | -A shortcut for egrep - like behavior: REG_EXTENDED | REG_NEWLINE_ALT | -- |
-
regerror takes the following parameters, it maps an
-error code to a human readable string:
-
- | int code | -The error code. | -- |
- | const regex_t* e | -The regular expression (can - be null). | -- |
- | char* buf | -The buffer to fill in with - the error message. | -- |
- | unsigned int buf_size | -The length of buf. | -- |
If the error code is OR'ed with REG_ITOA then the message that -results is the printable name of the code rather than a message, -for example "REG_BADPAT". If the code is REG_ATIO then e -must not be null and e->re_pend must point to the -printable name of an error code, the return value is then the -value of the error code. For any other value of code, the -return value is the number of characters in the error message, if -the return value is greater than or equal to buf_size then -regerror will have to be called again with a larger buffer.
- -regexec finds the first occurrence of expression e
-within string buf. If len is non-zero then *m
-is filled in with what matched the regular expression, m[0]
-contains what matched the whole string, m[1] the first sub-expression
-etc, see regmatch_t in the header file declaration for
-more details. The eflags parameter can be a combination of:
-
-
- | REG_NOTBOL | -Parameter buf does - not represent the start of a line. | -- |
- | REG_NOTEOL | -Parameter buf does - not terminate at the end of a line. | -- |
- | REG_STARTEND | -The string searched starts - at buf + pmatch[0].rm_so and ends at buf + pmatch[0].rm_eo. | -- |
-
Finally regfree frees all the memory that was allocated -by regcomp.
- -Footnote: this is an abridged reference to the POSIX API
-functions, it is provided for compatibility with other libraries,
-rather than an API to be used in new code (unless you need access
-from a language other than C++). This version of these functions
-should also happily coexist with other versions, as the names
-used are macros that expand to the actual function names.
-
Copyright Dr -John Maddock 1998-2000 all rights reserved.
- - diff --git a/syntax.htm b/syntax.htm deleted file mode 100644 index 327071e5..00000000 --- a/syntax.htm +++ /dev/null @@ -1,742 +0,0 @@ - - - - - - -- -
Regex++, Regular - Expression Syntax.-Copyright (c) 1998-2001 -Dr John Maddock -Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty. - |
-
This section covers the regular expression syntax used by this -library, this is a programmers guide, the actual syntax presented -to your program's users will depend upon the flags used during -expression compilation.
- -Literals
- -All characters are literals except: ".", "|",
-"*", "?", "+", "(",
-")", "{", "}", "[",
-"]", "^", "$" and "\".
-These characters are literals when preceded by a "\". A
-literal is a character that matches itself, or matches the result
-of traits_type::translate(), where traits_type is the traits
-template parameter to class reg_expression.
-
-
Wildcard
- -The dot character "." matches any single character
-except : when match_not_dot_null is passed to the matching
-algorithms, the dot does not match a null character; when match_not_dot_newline
-is passed to the matching algorithms, then the dot does not match
-a newline character.
-
-
Repeats
- -A repeat is an expression that is repeated an arbitrary number -of times. An expression followed by "*" can be repeated -any number of times including zero. An expression followed by -"+" can be repeated any number of times, but at least -once, if the expression is compiled with the flag regbase::bk_plus_qm -then "+" is an ordinary character and "\+" -represents a repeat of once or more. An expression followed by -"?" may be repeated zero or one times only, if the -expression is compiled with the flag regbase::bk_plus_qm then -"?" is an ordinary character and "\?" -represents the repeat zero or once operator. When it is necessary -to specify the minimum and maximum number of repeats explicitly, -the bounds operator "{}" may be used, thus "a{2}" -is the letter "a" repeated exactly twice, "a{2,4}" -represents the letter "a" repeated between 2 and 4 -times, and "a{2,}" represents the letter "a" -repeated at least twice with no upper limit. Note that there must -be no white-space inside the {}, and there is no upper limit on -the values of the lower and upper bounds. When the expression is -compiled with the flag regbase::bk_braces then "{" and -"}" are ordinary characters and "\{" and -"\}" are used to delimit bounds instead. All repeat -expressions refer to the shortest possible previous sub-expression: -a single character; a character set, or a sub-expression grouped -with "()" for example.
- -Examples:
- -"ba*" will match all of "b", "ba", -"baaa" etc.
- -"ba+" will match "ba" or "baaaa" -for example but not "b".
- -"ba?" will match "b" or "ba".
- -"ba{2,4}" will match "baa", "baaa" -and "baaaa".
- -Non-greedy repeats
- -Whenever the "extended" regular expression syntax is -in use (the default) then non-greedy repeats are possible by -appending a '?' after the repeat; a non-greedy repeat is one -which will match the shortest possible string.
- -For example to match html tag pairs one could use something -like:
- -"<\s*tagname[^>]*>(.*?)<\s*/tagname\s*>" -
- -In this case $1 will contain the text between the tag pairs,
-and will be the shortest possible matching string.
-
-
Parenthesis
- -Parentheses serve two purposes, to group items together into a -sub-expression, and to mark what generated the match. For example -the expression "(ab)*" would match all of the string -"ababab". The matching algorithms regex_match and regex_search each -take an instance of match_results -that reports what caused the match, on exit from these functions -the match_results -contains information both on what the whole expression matched -and on what each sub-expression matched. In the example above -match_results[1] would contain a pair of iterators denoting the -final "ab" of the matching string. It is permissible -for sub-expressions to match null strings. If a sub-expression -takes no part in a match - for example if it is part of an -alternative that is not taken - then both of the iterators that -are returned for that sub-expression point to the end of the -input string, and the matched parameter for that sub-expression -is false. Sub-expressions are indexed from left to right -starting from 1, sub-expression 0 is the whole expression.
- -Non-Marking Parenthesis
- -Sometimes you need to group sub-expressions with parenthesis, -but don't want the parenthesis to spit out another marked sub-expression, -in this case a non-marking parenthesis (?:expression) can be used. -For example the following expression creates no sub-expressions:
- -"(?:abc)*"
- -Forward Lookahead Asserts
- -There are two forms of these; one for positive forward -lookahead asserts, and one for negative lookahead asserts:
- -"(?=abc)" matches zero characters only if they are -followed by the expression "abc".
- -"(?!abc)" matches zero characters only if they are -not followed by the expression "abc".
- -Alternatives
- -Alternatives occur when the expression can match either one -sub-expression or another, each alternative is separated by a -"|", or a "\|" if the flag regbase::bk_vbar -is set, or by a newline character if the flag regbase::newline_alt -is set. Each alternative is the largest possible previous sub-expression; -this is the opposite behaviour from repetition operators.
- -Examples:
- -"a(b|c)" could match "ab" or "ac". -
- -"abc|def" could match "abc" or "def".
-
-
-
Sets
- -A set is a set of characters that can match any single -character that is a member of the set. Sets are delimited by -"[" and "]" and can contain literals, -character ranges, character classes, collating elements and -equivalence classes. Set declarations that start with "^" -contain the compliment of the elements that follow.
- -Examples:
- -Character literals:
- -"[abc]" will match either of "a", "b", -or "c".
- -"[^abc] will match any character other than "a", -"b", or "c".
- -Character ranges:
- -"[a-z]" will match any character in the range "a" -to "z".
- -"[^A-Z]" will match any character other than those -in the range "A" to "Z".
- -Note that character ranges are highly locale dependent: they -match any character that collates between the endpoints of the -range, ranges will only behave according to ASCII rules when the -default "C" locale is in effect. For example if the -library is compiled with the Win32 localization model, then [a-z] -will match the ASCII characters a-z, and also 'A', 'B' etc, but -not 'Z' which collates just after 'z'. This locale specific -behaviour can be disabled by specifying regbase::nocollate when -compiling, this is the default behaviour when using regbase::normal, -and forces ranges to collate according to ASCII character code. -Likewise, if you use the POSIX C API functions then setting -REG_NOCOLLATE turns off locale dependent collation.
- -Character classes are denoted using the syntax "[:classname:]"
-within a set declaration, for example "[[:space:]]" is
-the set of all whitespace characters. Character classes are only
-available if the flag regbase::char_classes is set. The available
-character classes are:
-
- | alnum | -Any alpha numeric character. | -- |
- | alpha | -Any alphabetical character a-z - and A-Z. Other characters may also be included depending - upon the locale. | -- |
- | blank | -Any blank character, either - a space or a tab. | -- |
- | cntrl | -Any control character. | -- |
- | digit | -Any digit 0-9. | -- |
- | graph | -Any graphical character. | -- |
- | lower | -Any lower case character a-z. - Other characters may also be included depending upon the - locale. | -- |
- | Any printable character. | -- | |
- | punct | -Any punctuation character. | -- |
- | space | -Any whitespace character. | -- |
- | upper | -Any upper case character A-Z. - Other characters may also be included depending upon the - locale. | -- |
- | xdigit | -Any hexadecimal digit - character, 0-9, a-f and A-F. | -- |
- | word | -Any word character - all - alphanumeric characters plus the underscore. | -- |
- | unicode | -Any character whose code is - greater than 255, this applies to the wide character - traits classes only. | -- |
There are some shortcuts that can be used in place of the -character classes, provided the flag regbase::escape_in_lists is -set then you can use:
- -\w in place of [:word:]
- -\s in place of [:space:]
- -\d in place of [:digit:]
- -\l in place of [:lower:]
- -\u in place of [:upper:]
-
-
Collating elements take the general form [.tagname.] inside a
-set declaration, where tagname is either a single
-character, or a name of a collating element, for example [[.a.]]
-is equivalent to [a], and [[.comma.]] is equivalent to [,]. The
-library supports all the standard POSIX collating element names,
-and in addition the following digraphs: "ae", "ch",
-"ll", "ss", "nj", "dz",
-"lj", each in lower, upper and title case variations.
-Multi-character collating elements can result in the set matching
-more than one character, for example [[.ae.]] would match two
-characters, but note that [^[.ae.]] would only match one
-character.
-
-
Equivalence classes take the general form [=tagname=] inside a
-set declaration, where tagname is either a single
-character, or a name of a collating element, and matches any
-character that is a member of the same primary equivalence class
-as the collating element [.tagname.]. An equivalence class is a
-set of characters that collate the same, a primary equivalence
-class is a set of characters whose primary sort key are all the
-same (for example strings are typically collated by character,
-then by accent, and then by case; the primary sort key then
-relates to the character, the secondary to the accentation, and
-the tertiary to the case). If there is no equivalence class
-corresponding to tagname, then [=tagname=] is exactly the
-same as [.tagname.]. Unfortunately there is no locale independent
-method of obtaining the primary sort key for a character, except
-under Win32. For other operating systems the library will "guess"
-the primary sort key from the full sort key (obtained from strxfrm),
-so equivalence classes are probably best considered broken under
-any operating system other than Win32.
-
-
To include a literal "-" in a set declaration then:
-make it the first character after the opening "[" or
-"[^", the endpoint of a range, a collating element, or
-if the flag regbase::escape_in_lists is set then precede with an
-escape character as in "[\-]". To include a literal
-"[" or "]" or "^" in a set then
-make them the endpoint of a range, a collating element, or
-precede with an escape character if the flag regbase::escape_in_lists
-is set.
-
-
Line anchors
- -An anchor is something that matches the null string at the
-start or end of a line: "^" matches the null string at
-the start of a line, "$" matches the null string at the
-end of a line.
-
-
Back references
- -A back reference is a reference to a previous sub-expression
-that has already been matched, the reference is to what the sub-expression
-matched, not to the expression itself. A back reference consists
-of the escape character "\" followed by a digit "1"
-to "9", "\1" refers to the first sub-expression,
-"\2" to the second etc. For example the expression
-"(.*)\1" matches any string that is repeated about its
-mid-point for example "abcabc" or "xyzxyz". A
-back reference to a sub-expression that did not participate in
-any match, matches the null string: NB this is different to some
-other regular expression matchers. Back references are only
-available if the expression is compiled with the flag regbase::bk_refs
-set.
-
-
Characters by code
- -This is an extension to the algorithm that is not available in
-other libraries, it consists of the escape character followed by
-the digit "0" followed by the octal character code. For
-example "\023" represents the character whose octal
-code is 23. Where ambiguity could occur use parentheses to break
-the expression up: "\0103" represents the character
-whose code is 103, "(\010)3 represents the character 10
-followed by "3". To match characters by their
-hexadecimal code, use \x followed by a string of hexadecimal
-digits, optionally enclosed inside {}, for example \xf0 or
-\x{aff}, notice the latter example is a Unicode character.
-
-
Word operators
- -The following operators are provided for compatibility with -the GNU regular expression library.
- -"\w" matches any single character that is a member -of the "word" character class, this is identical to the -expression "[[:word:]]".
- -"\W" matches any single character that is not a -member of the "word" character class, this is identical -to the expression "[^[:word:]]".
- -"\<" matches the null string at the start of a -word.
- -"\>" matches the null string at the end of the -word.
- -"\b" matches the null string at either the start or -the end of a word.
- -"\B" matches a null string within a word.
- -The start of the sequence passed to the matching algorithms is
-considered to be a potential start of a word unless the flag
-match_not_bow is set. The end of the sequence passed to the
-matching algorithms is considered to be a potential end of a word
-unless the flag match_not_eow is set.
-
-
Buffer operators
- -The following operators are provide for compatibility with the -GNU regular expression library, and Perl regular expressions:
- -"\`" matches the start of a buffer.
- -"\A" matches the start of the buffer.
- -"\'" matches the end of a buffer.
- -"\z" matches the end of a buffer.
- -"\Z" matches the end of a buffer, or possibly one or -more new line characters followed by the end of the buffer.
- -A buffer is considered to consist of the whole sequence passed
-to the matching algorithms, unless the flags match_not_bob or
-match_not_eob are set.
-
-
Escape operator
- -The escape character "\" has several meanings.
- -Inside a set declaration the escape character is a normal -character unless the flag regbase::escape_in_lists is set in -which case whatever follows the escape is a literal character -regardless of its normal meaning.
- -The escape operator may introduce an operator for example: -back references, or a word operator.
- -The escape operator may make the following character normal,
-for example "\*" represents a literal "*"
-rather than the repeat operator.
-
-
Single character escape sequences
- -The following escape sequences are aliases for single
-characters:
-
- | Escape sequence | -Character code | -Meaning | -- |
- | \a | -0x07 | -Bell character. | -- |
- | \f | -0x0C | -Form feed. | -- |
- | \n | -0x0A | -Newline character. | -- |
- | \r | -0x0D | -Carriage return. | -- |
- | \t | -0x09 | -Tab character. | -- |
- | \v | -0x0B | -Vertical tab. | -- |
- | \e | -0x1B | -ASCII Escape character. | -- |
- | \0dd | -0dd | -An octal character code, - where dd is one or more octal digits. | -- |
- | \xXX | -0xXX | -A hexadecimal character - code, where XX is one or more hexadecimal digits. | -- |
- | \x{XX} | -0xXX | -A hexadecimal character - code, where XX is one or more hexadecimal digits, - optionally a unicode character. | -- |
- | \cZ | -z-@ | -An ASCII escape sequence - control-Z, where Z is any ASCII character greater than or - equal to the character code for '@'. | -- |
-
Miscellaneous escape sequences:
- -The following are provided mostly for perl compatibility, but
-note that there are some differences in the meanings of \l \L \u
-and \U:
-
- | \w | -Equivalent to [[:word:]]. | -- |
- | \W | -Equivalent to [^[:word:]]. | -- |
- | \s | -Equivalent to [[:space:]]. | -- |
- | \S | -Equivalent to [^[:space:]]. | -- |
- | \d | -Equivalent to [[:digit:]]. | -- |
- | \D | -Equivalent to [^[:digit:]]. | -- |
- | \l | -Equivalent to [[:lower:]]. | -- |
- | \L | -Equivalent to [^[:lower:]]. | -- |
- | \u | -Equivalent to [[:upper:]]. | -- |
- | \U | -Equivalent to [^[:upper:]]. | -- |
- | \C | -Any single character, - equivalent to '.'. | -- |
- | \X | -Match any Unicode combining - character sequence, for example "a\x 0301" (a - letter a with an acute). | -- |
- | \Q | -The begin quote operator, - everything that follows is treated as a literal character - until a \E end quote operator is found. | -- |
- | \E | -The end quote operator, - terminates a sequence begun with \Q. | -- |
-
What gets matched?
- -The regular expression library will match the first possible
-matching string, if more than one string starting at a given
-location can match then it matches the longest possible string,
-unless the flag match_any is set, in which case the first match
-encountered is returned. Use of the match_any option can reduce
-the time taken to find the match - but is only useful if the user
-is less concerned about what matched - for example it would not
-be suitable for search and replace operations. In cases where
-their are multiple possible matches all starting at the same
-location, and all of the same length, then the match chosen is
-the one with the longest first sub-expression, if that is the
-same for two or more matches, then the second sub-expression will
-be examined and so on.
-
Copyright Dr -John Maddock 1998-2000 all rights reserved.
- - diff --git a/template_class_ref.htm b/template_class_ref.htm deleted file mode 100644 index ccd0d3c9..00000000 --- a/template_class_ref.htm +++ /dev/null @@ -1,2479 +0,0 @@ - - - - - - -- -
Regex++ template - class reference.-Copyright (c) 1998-2001 -Dr John Maddock -Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty. - |
-
#include <boost/regex.hpp> -
- -Class regbase is the template argument independent base class -for reg_expression, the only public members are the flag_type -enumerated values that determine how regular expressions are -interpreted.
- -class regbase -{ -public: - enum flag_type_ - { - escape_in_lists = 1, // '\\' special inside [...] - char_classes = escape_in_lists << 1, // [[:CLASS:]] allowed - intervals = char_classes << 1, // {x,y} allowed - limited_ops = intervals << 1, // all of + ? and | are normal characters - newline_alt = limited_ops << 1, // \n is the same as | - bk_plus_qm = newline_alt << 1, // uses \+ and \? - bk_braces = bk_plus_qm << 1, // uses \{ and \} - bk_parens = bk_braces << 1, // uses \( and \) - bk_refs = bk_parens << 1, // \d allowed - bk_vbar = bk_refs << 1, // uses \| - use_except = bk_vbar << 1, // exception on error - failbit = use_except << 1, // error flag - literal = failbit << 1, // all characters are literals - icase = literal << 1, // characters are matched regardless of case - nocollate = icase << 1, // don't use locale specific collation - - basic = char_classes | intervals | limited_ops | bk_braces | bk_parens | bk_refs, - extended = char_classes | intervals | bk_refs, - normal = escape_in_lists | char_classes | intervals | bk_refs | nocollate, - emacs = bk_braces | bk_parens | bk_refs | bk_vbar, - awk = extended | escape_in_lists, - grep = basic | newline_alt, - egrep = extended | newline_alt, - sed = basic, - perl = normal - }; - typedef unsigned int flag_type; -};- -
-
-
The enumerated type regbase::flag_type determines the
-syntax rules for regular expression compilation, the various
-flags have the following effects:
-
- | regbase::escape_in_lists | -Allows the use of the escape - "\" character in sets of characters, for - example [\]] represents the set of characters containing - only "]". If this flag is not set then "\" - is an ordinary character inside sets. | -- |
- | regbase::char_classes | -When this bit is set, - character classes [:classname:] are allowed inside - character set declarations, for example "[[:word:]]" - represents the set of all characters that belong to the - character class "word". | -- |
- | regbase:: intervals | -When this bit is set, - repetition intervals are allowed, for example "a{2,4}" - represents a repeat of between 2 and 4 letter a's. | -- |
- | regbase:: limited_ops | -When this bit is set all of - "+", "?" and "|" are - ordinary characters in all situations. | -- |
- | regbase:: newline_alt | -When this bit is set, then - the newline character "\n" has the same effect - as the alternation operator "|". | -- |
- | regbase:: bk_plus_qm | -When this bit is set then - "\+" represents the one or more repetition - operator and "\?" represents the zero or one - repetition operator. When this bit is not set then - "+" and "?" are used instead. | -- |
- | regbase:: bk_braces | -When this bit is set then - "\{" and "\}" are used for bounded - repetitions and "{" and "}" are - normal characters. This is the opposite of default - behavior. | -- |
- | regbase:: bk_parens | -When this bit is set then - "\(" and "\)" are used to group sub-expressions - and "(" and ")" are ordinary - characters, this is the opposite of default behaviour. | -- |
- | regbase:: bk_refs | -When this bit is set then - back references are allowed. | -- |
- | regbase:: bk_vbar | -When this bit is set then - "\|" represents the alternation operator and - "|" is an ordinary character. This is the - opposite of default behaviour. | -- |
- | regbase:: use_except | -When this bit is set then a bad_expression exception will - be thrown on error. Use of this flag is deprecated - - reg_expression will always throw on error. | -- |
- | regbase:: failbit | -This bit is set on error, if - regbase::use_except is not set, then this bit should be - checked to see if a regular expression is valid before - usage. | -- |
- | regbase::literal | -All characters in the string - are treated as literals, there are no special characters - or escape sequences. | -- |
- | regbase::icase | -All characters in the string - are matched regardless of case. | -- |
- | regbase::nocollate | -Locale specific collation is - disabled when dealing with ranges in character set - declarations. For example when this bit is set the - expression [a-c] would match the characters a, b and c - only regardless of locale, where as when this is not set - , then [a-c] matches any character which collates in the - range a to c. | -- |
- | regbase::basic | -Equivalent to the POSIX - basic regular expression syntax: char_classes | intervals - | limited_ops | bk_braces | bk_parens | bk_refs. | -- |
- | Regbase::extended | -Equivalent to the POSIX - extended regular expression syntax: char_classes | - intervals | bk_refs. | -- |
- | regbase::normal | -This is the - default setting, and represents how most people expect - the library to behave. Equivalent to the POSIX extended - syntax, but with locale specific collation disabled, and - escape characters inside set declarations enabled: - regbase::escape_in_lists | regbase::char_classes | - regbase::intervals | regbase::bk_refs | regbase::nocollate. | -- |
- | regbase::emacs | -Provides - compatability with the emacs editor, eqivalent to: - bk_braces | bk_parens | bk_refs | bk_vbar. | -- |
- | regbase::awk | -Provides - compatabilty with the Unix utility Awk, the same as POSIX - extended regular expressions, but allows escapes inside - bracket-expressions (character sets). Equivalent to - extended | escape_in_lists. | -- |
- | regbase::grep | -Provides - compatabilty with the Unix grep utility, the same as - POSIX basic regular expressions, but with the newline - character equivalent to the alternation operator. the - same as basic | newline_alt. | -- |
- | regbase::egrep | -Provides - compatabilty with the Unix egrep utility, the same as - POSIX extended regular expressions, but with the newline - character equivalent to the alternation operator. the - same as extended | newline_alt. | -- |
- | regbase::sed | -Provides - compatabilty with the Unix sed utility, the same as POSIX - basic regular expressions. | -- |
- | regbase::perl | -Provides - compatibility with the perl programming language, the - same as regbase::normal. | -- |
#include <boost/pat_except.hpp> -
- -An instance of bad_expression is thrown whenever a bad -regular expression is encountered.
- -namespace boost{ - -class bad_pattern : public std::runtime_error -{ -public: - explicit bad_pattern(const std::string& s) : std::runtime_error(s){}; -}; - -class bad_expression : public bad_pattern -{ -public: - bad_expression(const std::string& s) : bad_pattern(s) {} -}; - - -} // namespace boost- -
Footnotes: the class bad_pattern forms the base class -for all pattern-matching exceptions, of which bad_expression -is one. The choice of std::runtime_error as the base class -for bad_pattern is moot, depending upon how the library is -used exceptions may be either logic errors (programmer supplied -expressions) or run time errors (user supplied expressions).
- -#include <boost/regex.hpp> -
- -The template class reg_expression encapsulates regular -expression parsing and compilation. The class derives from class regbase and takes three template -parameters:
- -charT: determines the character type, i.e. -either char or wchar_t.
- -traits: determines the behaviour of the -character type, for example whether character matching is case -sensitive or not, and which character class names are recognized. -A default traits class is provided: regex_traits<charT>. -
- -Allocator: the allocator class used to allocate -memory by the class.
- -For ease of use there are two typedefs that define the two -standard reg_expression instances, unless you want to use -custom allocators, you won't need to use anything other than -these:
- -namespace boost{ -template <class charT, class traits = regex_traits<charT>, class Allocator = std::allocator<charT> > -class reg_expression; -typedef reg_expression<char> regex; -typedef reg_expression<wchar_t> wregex; -}- -
The definition of reg_expression follows: it is based -very closely on class basic_string, and fulfils the requirements -for a container of charT.
- -namespace boost{ -template <class charT, class traits = regex_traits<charT>, class Allocator = std::allocator<charT> > -class reg_expression : public regbase -{ -public: - // typedefs: - typedef charT char_type; - typedef traits traits_type; - // locale_type - // placeholder for actual locale type used by the - // traits class to localise *this. - typedef typename traits::locale_type locale_type; - // value_type - typedef charT value_type; - // reference, const_reference - typedef charT& reference; - typedef const charT& const_reference; - // iterator, const_iterator - typedef const charT* const_iterator; - typedef const_iterator iterator; - // difference_type - typedef typename Allocator::difference_type difference_type; - // size_type - typedef typename Allocator::size_type size_type; - // allocator_type - typedef Allocator allocator_type; - typedef Allocator alloc_type; - // flag_type - typedef boost::int_fast32_t flag_type; -public: - // constructors - explicit reg_expression(const Allocator& a = Allocator()); - explicit reg_expression(const charT* p, flag_type f = regbase::normal, const Allocator& a = Allocator()); - reg_expression(const charT* p1, const charT* p2, flag_type f = regbase::normal, const Allocator& a = Allocator()); - reg_expression(const charT* p, size_type len, flag_type f, const Allocator& a = Allocator()); - reg_expression(const reg_expression&); - template <class ST, class SA> - explicit reg_expression(const std::basic_string<charT, ST, SA>& p, flag_type f = regbase::normal, const Allocator& a = Allocator()); - template <class I> - reg_expression(I first, I last, flag_type f = regbase::normal, const Allocator& a = Allocator()); - ~reg_expression(); - reg_expression& operator=(const reg_expression&); - reg_expression& operator=(const charT* ptr); - template <class ST, class SA> - reg_expression& operator=(const std::basic_string<charT, ST, SA>& p); - // - // assign: - reg_expression& assign(const reg_expression& that); - reg_expression& assign(const charT* ptr, flag_type f = regbase::normal); - reg_expression& assign(const charT* first, const charT* last, flag_type f = regbase::normal); - template <class string_traits, class A> - reg_expression& assign( - const std::basic_string<charT, string_traits, A>& s, - flag_type f = regbase::normal); - template <class iterator> - reg_expression& assign(iterator first, - iterator last, - flag_type f = regbase::normal); - // - // allocator access: - Allocator get_allocator()const; - // - // locale: - locale_type imbue(locale_type l); - locale_type getloc()const; - // - // flags: - flag_type getflags()const; - // - // str: - std::basic_string<charT> str()const; - // - // begin, end: - const_iterator begin()const; - const_iterator end()const; - // - // swap: - void swap(reg_expression&)throw(); - // - // size: - size_type size()const; - // - // max_size: - size_type max_size()const; - // - // empty: - bool empty()const; - unsigned mark_count()const; - bool operator==(const reg_expression&)const; - bool operator<(const reg_expression&)const; -}; -} // namespace boost- -
Class reg_expression has the following public member functions:
-
-
- | reg_expression(Allocator a = - Allocator()); | -Constructs a default - instance of reg_expression without any expression. | -- |
- | reg_expression(charT* p, unsigned - f = regbase::normal, Allocator a = Allocator()); | -Constructs an instance - of reg_expression from the expression denoted by the null - terminated string p, using the flags f to - determine regular expression syntax. See class regbase for allowable flag values. | -- |
- | reg_expression(charT* p1, - charT* p2, unsigned f = regbase::normal, Allocator - a = Allocator()); | -Constructs an instance - of reg_expression from the expression denoted by pair of - input-iterators p1 and p2, using the flags f - to determine regular expression syntax. See class regbase for allowable flag values. | -- |
- | reg_expression(charT* p, - size_type len, unsigned f, Allocator a = Allocator()); | -Constructs an instance - of reg_expression from the expression denoted by the - string p of length len, using the flags f - to determine regular expression syntax. See class regbase for allowable flag values. | -- |
- | template <class
- ST, class SA> - reg_expression(const std::basic_string<charT, - ST, SA>& p, boost::int_fast32_t f = regbase::normal, - const Allocator& a = Allocator()); |
- Constructs an instance
- of reg_expression from the expression denoted by the
- string p, using the flags f to determine
- regular expression syntax. See class regbase
- for allowable flag values. Note - this member may not - be available depending upon your compiler capabilities. - |
- - |
- | template <class I> - reg_expression(I first, I last, flag_type f = regbase::normal, - const Allocator& a = Allocator()); |
- Constructs an instance - of reg_expression from the expression denoted by pair of - input-iterators p1 and p2, using the flags f - to determine regular expression syntax. See class regbase for allowable flag values. | -- |
- | reg_expression(const - reg_expression&); | -Copy constructor - copies an - existing regular expression. | -- |
- | reg_expression& operator=(const - reg_expression&); | -Copies an existing regular - expression. | -- |
- | reg_expression& operator=(const - charT* ptr); | -Equivalent to assign(ptr); | -- |
- | template <class ST, class
- SA> reg_expression& operator=(const std::basic_string<charT, - ST, SA>& p); - |
- Equivalent to assign(p); | -- |
- | reg_expression& assign(const - reg_expression& that); | -Copies the regular - expression contained by that, throws bad_expression if that - does not contain a valid expression. Returns *this. | -- |
- | reg_expression& assign(const - charT* p, flag_type f = regbase::normal); | -Compiles a regular - expression from the expression denoted by the null - terminated string p, using the flags f to - determine regular expression syntax. See class regbase for allowable flag values. - Throws bad_expression if p - does not contain a valid expression. Returns *this. | -- |
- | reg_expression& assign(const - charT* first, const charT* last, flag_type f = - regbase::normal); | -Compiles a regular - expression from the expression denoted by the pair of - input-iterators first-last, using the flags f - to determine regular expression syntax. See class regbase for allowable flag values. - Throws bad_expression if first-last - does not contain a valid expression. Returns *this. | -- |
- | template <class
- string_traits, class A> - reg_expression& assign(const std::basic_string<charT, - string_traits, A>& s, flag_type f = regbase::normal); |
- Compiles a regular - expression from the expression denoted by the string s, - using the flags f to determine regular expression - syntax. See class regbase for - allowable flag values. Throws bad_expression - if s does not contain a valid expression. Returns - *this. | -- |
- | template <class
- iterator> - reg_expression& assign(iterator first, iterator last, - flag_type f = regbase::normal); |
- Compiles a regular - expression from the expression denoted by the pair of - input-iterators first-last, using the flags f - to determine regular expression syntax. See class regbase for allowable flag values. - Throws bad_expression if first-last - does not contain a valid expression. Returns *this. | -- |
- | Allocator get_allocator()const; | -Returns the allocator used - by the expression. | -- |
- | locale_type imbue(const - locale_type& l); | -Imbues the expression with - the specified locale, and invalidates the current - expression. May throw std::runtime_error if the call - results in an attempt to open a non-existent message - catalogue. | -- |
- | locale_type getloc()const; | -Returns the locale used by - the expression. | -- |
- | flag_type getflags()const; | -Returns the flags used to - compile the current expression. | -- |
- | std::basic_string<charT> - str()const; | -Returns the current - expression as a string. | -- |
- | const_iterator begin()const; | -Returns a pointer to the - first character of the current expression. | -- |
- | const_iterator end()const; | -Returns a pointer to the end - of the current expression. | -- |
- | size_type size()const; | -Returns the length of the - current expression. | -- |
- | size_type max_size()const; | -Returns the maximum length - of a regular expression text. | -- |
- | bool empty()const; | -Returns true if the object - contains no valid expression. | -- |
- | unsigned mark_count()const - ; | -Returns the number of sub-expressions - in the compiled regular expression. Note that this - includes the whole match (subexpression zero), so the - value returned is always >= 1. | -- |
#include <boost/regex/regex_traits.hpp> -
- -This is a preliminary version of the regular expression -traits class, and is subject to change.
- -The purpose of the traits class is to make it easier to -customise the behaviour of reg_expression and the -associated matching algorithms. Custom traits classes can handle -special character sets or define additional character classes, -for example one could define [[:kanji:]] as the set of all (Unicode) -kanji characters. This library provides three traits classes and -a wrapper class regex_traits, which inherits from one of -these depending upon the default localisation model in use, class -c_regex_traits encapsulates the global C locale, class w32_regex_traits -encapsulates the global Win32 locale (only available on Win32 -systems), and class cpp_regex_traits encapsulates the C++ -locale (only provided if std::locale is supported):
- -template <class charT> class c_regex_traits; -template<> class c_regex_traits<char> { /*details*/ }; -template<> class c_regex_traits<wchar_t> { /*details*/ }; - -template <class charT> class w32_regex_traits; -template<> class w32_regex_traits<char> { /*details*/ }; -template<> class w32_regex_traits<wchar_t> { /*details*/ }; - -template <class charT> class cpp_regex_traits; -template<> class cpp_regex_traits<char> { /*details*/ }; -template<> class cpp_regex_traits<wchar_t> { /*details*/ }; - -template <class charT> class regex_traits : public base_type { /*detailts*/ };- -
Where "base_type" defaults to w32_regex_traits -on Win32 systems, and c_regex_traits otherwise. The -default behaviour can be changed by defining one of -BOOST_REGEX_USE_C_LOCALE (forces use of c_regex_traits by -default), or BOOST_REGEX_USE_CPP_LOCALE (forces use of cpp_regex_traits -by default). Alternatively a specific traits class can be passed -to the reg_expression template.
- -The requirements for custom traits classes are documented separately here....
- -There is also an example of a custom traits class supplied by Christian Engström, -see iso8859_1_regex_traits.cpp -and iso8859_1_regex_traits.hpp, -see the -readme file for more details.
- -#include <boost/regex.hpp> -
- -Regular expressions are different from many simple pattern-matching -algorithms in that as well as finding an overall match they can -also produce sub-expression matches: each sub-expression being -delimited in the pattern by a pair of parenthesis (...). There -has to be some method for reporting sub-expression matches back -to the user: this is achieved this by defining a class match_results -that acts as an indexed collection of sub-expression matches, -each sub-expression match being contained in an object of type sub_match. -
- -// -// class sub_match: -// denotes one sub-expression match. -// -template <class iterator> -struct sub_match -{ - typedef typename std::iterator_traits<iterator>::value_type value_type; - typedef typename std::iterator_traits<iterator>::difference_type difference_type; - typedef iterator iterator_type; - - iterator first; - iterator second; - bool matched; - - operator std::basic_string<value_type>()const; - - bool operator==(const sub_match& that)const; - bool operator !=(const sub_match& that)const; - difference_type length()const; -}; - -// -// class match_results: -// contains an indexed collection of matched sub-expressions. -// -template <class iterator, class Allocator = std::allocator<typename std::iterator_traits<iterator>::value_type > > -class match_results -{ -public: - typedef Allocator alloc_type; - typedef typename Allocator::template Rebind<iterator>::size_type size_type; - typedef typename std::iterator_traits<iterator>::value_type char_type; - typedef sub_match<iterator> value_type; - typedef typename std::iterator_traits<iterator>::difference_type difference_type; - typedef iterator iterator_type; - explicit match_results(const Allocator& a = Allocator()); - match_results(const match_results& m); - match_results& operator=(const match_results& m); - ~match_results(); - size_type size()const; - const sub_match<iterator>& operator[](int n) const; - Allocator allocator()const; - difference_type length(int sub = 0)const; - difference_type position(unsigned int sub = 0)const; - unsigned int line()const; - iterator line_start()const; - std::basic_string<char_type> str(int sub = 0)const; - void swap(match_results& that); - bool operator==(const match_results& that)const; - bool operator<(const match_results& that)const; -}; -typedef match_results<const char*> cmatch; -typedef match_results<const wchar_t*> wcmatch; -typedef match_results<std::string::const_iterator> smatch; -typedef match_results<std::wstring::const_iterator> wsmatch;- -
Class match_results is used for reporting what matched a
-regular expression, it is passed to the matching algorithms regex_match and regex_search,
-and is used by regex_grep to notify the
-callback function (or function object) what matched. Note that
-the default allocator parameter has been chosen to match the
-default allocator parameter to reg_expression. match_results has
-the following public member functions:
-
- | match_results(Allocator a = - Allocator()); | -Constructs an instance of - match_results, using allocator instance a. | -- |
- | match_results(const - match_results& m); | -Copy constructor. | -- |
- | match_results& operator=(const - match_results& m); | -Assignment operator. | -- |
- | const - sub_match<iterator>& operator[](size_type - n) const; | -Returns what matched, item 0 - represents the whole string, item 1 the first sub-expression - and so on. | -- |
- | Allocator& allocator()const; | -Returns the allocator used - by the class. | -- |
- | difference_type length(unsigned - int sub = 0); | -Returns the length of the - matched subexpression, defaults to the length of the - whole match, in effect this is equivalent to operator[](sub).second - - operator[](sub).first. | -- |
- | difference_type position(unsigned - int sub = 0); | -Returns the position of the - matched sub-expression, defaults to the position of the - whole match. The returned value is the position of the - match relative to the start of the string. | -- |
- | unsigned int - line()const; | -Returns the index of the - line on which the match occurred, indices start with 1, - not zero. Equivalent to the number of newline characters - prior to operator[](0).first plus one. | -- |
- | iterator line_start()const; | -Returns an iterator denoting - the start of the line on which the match occurred. | -- |
- | size_type size()const; | -Returns how many sub-expressions - are present in the match, including sub-expression zero (the - whole match). This is the case even if no matches were - found in the search operation - you must use the returned - value from regex_search / regex_match to determine whether - any match occured. | -- |
-
The operator[] member function needs further explanation: it
-returns a const reference to a structure of type
-sub_match<iterator>, which has the following public members:
-
-
- | typedef typename - std::iterator_traits<iterator>::value_type - value_type; | -The type pointed to by the - iterators. | -- |
- | typedef typename - std::iterator_traits<iterator>::difference_type - difference_type; | -A type that represents the - difference between two iterators. | -- |
- | typedef iterator - iterator_type; | -The iterator type. | -- |
- | iterator first | -An iterator denoting the - position of the start of the match. | -- |
- | iterator second | -An iterator denoting the - position of the end of the match. | -- |
- | bool matched | -A Boolean value denoting - whether this sub-expression participated in the match. | -- |
- | difference_type length()const; | -Returns the length of the - sub-expression match. | -- |
- | operator std::basic_string<value_type> - ()const; | -Converts the sub-expression - match into an instance of std::basic_string<>. Note - that this member may be either absent, or present to a - more limited degree depending upon your compiler - capabilities. | -- |
Operator[] takes an integer as an argument that denotes the
-sub-expression for which to return information, the argument can
-take the following special values:
-
- | -2 | -Returns everything from the
- end of the match, to the end of the input string,
- equivalent to $' in perl. If this is a null string, then:
- first == second -And -matched == false. - |
- - |
- | -1 | -Returns everything from the
- start of the input string (or the end of the last match
- if this is a grep operation), to the start of this match.
- Equivalent to $` in perl. If this is a null string, then:
- first == second -And -matched == false. - |
- - |
- | 0 | -Returns the whole of what - matched, equivalent to $& in perl. The matched - parameter is always true. | -- |
- | 0 < N < size() | -Returns what matched sub-expression
- N, if this sub-expression did not participate in the
- match then matched == false -otherwise: -matched == true. - |
- - |
- | N < -2 or N >= size() | -Represents an out-of range
- non-existent sub-expression. Returns a "null"
- match in which first == last -And -matched == false. - |
- - |
Note that as well as being parameterised for an allocator, -match_results<> also takes an iterator type, this allows -any pair of iterators to be searched for a given regular -expression, provided the iterators have at least bi-directional -properties.
- -#include <boost/regex.hpp> -
- -The algorithm regex _match determines whether a given regular -expression matches a given sequence denoted by a pair of -bidirectional-iterators, the algorithm is defined as follows, note -that the result is true only if the expression matches the whole -of the input sequence, the main use of this function is data -input validation:
- -template <class iterator, class Allocator, class charT, class traits, class Allocator2> -bool regex_match(iterator first, - iterator last, - match_results<iterator, Allocator>& m, - const reg_expression<charT, traits, Allocator2>& e, - unsigned flags = match_default);- -
The library also defines the following convenience versions, -which take either a const charT*, or a const std::basic_string<>& -in place of a pair of iterators [note - these versions may not be -available, or may be available in a more limited form, depending -upon your compilers capabilities]:
- -template <class charT, class Allocator, class traits, class Allocator2> -bool regex_match(const charT* str, - match_results<const charT*, Allocator>& m, - const reg_expression<charT, traits, Allocator2>& e, - unsigned flags = match_default) - -template <class ST, class SA, class Allocator, class charT, class traits, class Allocator2> -bool regex_match(const std::basic_string<charT, ST, SA>& s, - match_results<typename std::basic_string<charT, ST, SA>::const_iterator, Allocator>& m, - const reg_expression<charT, traits, Allocator2>& e, - unsigned flags = match_default);- -
Finally there is a set of convenience versions that simply -return true or false and do not indicate what matched:
- -template <class iterator, class Allocator, class charT, class traits, class Allocator2> -bool regex_match(iterator first, - iterator last, - const reg_expression<charT, traits, Allocator2>& e, - unsigned flags = match_default); - -template <class charT, class Allocator, class traits, class Allocator2> -bool regex_match(const charT* str, - const reg_expression<charT, traits, Allocator2>& e, - unsigned flags = match_default) - -template <class ST, class SA, class Allocator, class charT, class traits, class Allocator2> -bool regex_match(const std::basic_string<charT, ST, SA>& s, - const reg_expression<charT, traits, Allocator2>& e, - unsigned flags = match_default);- -
The parameters for the main function version are as follows:
-
- | iterator first | -Denotes the start of the range to be matched. | -- |
- | iterator last | -Denotes the end of the range - to be matched. | -- |
- | match_results<iterator, - Allocator>& m | -An instance of match_results
- in which what matched will be reported. On exit if a
- match occurred then m[0] denotes the whole of the string
- that matched, m[0].first must be equal to first, m[0].second
- will be less than or equal to last. m[1] denotes the
- first subexpression m[2] the second subexpression and so
- on. If no match occurred then m[0].first = m[0].second =
- last. Note that since the match_results structure - stores only iterators, and not strings, the iterators/strings - passed to regex_match must be valid for as long as the - result is to be used. For that reason never pass - temporary string objects to regex_match. - |
- - |
- | const - reg_expression<charT, traits, Allocator2>& e | -Contains the regular - expression to be matched. | -- |
- | unsigned flags = - match_default | -Determines the semantics - used for matching, a combination of one or more match_flags enumerators. | -- |
regex_match returns false if no match occurs or true if it -does. A match only occurs if it starts at first and -finishes at last. Example: the following example -processes an ftp response:
- -#include <stdlib.h> -#include <boost/regex.hpp> -#include <string> -#include <iostream> - -using namespace boost; - -regex expression("([0-9]+)(\\-| |$)(.*)"); - -// process_ftp: -// on success returns the ftp response code, and fills -// msg with the ftp response message. -int process_ftp(const char* response, std::string* msg) -{ - cmatch what; - if(regex_match(response, what, expression)) - { - // what[0] contains the whole string - // what[1] contains the response code - // what[2] contains the separator character - // what[3] contains the text message. - if(msg) - msg->assign(what[3].first, what[3].second); - return std::atoi(what[1].first); - } - // failure did not match - if(msg) - msg->erase(); - return -1; -}- -
The value of the flags parameter
-passed to the algorithm must be a combination of one or more of
-the following values:
-
- | match_default | -The default value, indicates - that first represents the start of a line, the - start of a buffer, and (possibly) the start of a word. - Also implies that last represents the end of a - line, the end of the buffer and (possibly) the end of a - word. Implies that a dot sub-expression "." - will match both the newline character and a null. | -- |
- | match_not_bol | -When this flag is set then first - does not represent the start of a new line. | -- |
- | match_not_eol | -When this flag is set then last - does not represent the end of a line. | -- |
- | match_not_bob | -When this flag is set then first - is not the beginning of a buffer. | -- |
- | match_not_eob | -When this flag is set then last - does not represent the end of a buffer. | -- |
- | match_not_bow | -When this flag is set then first - can never match the start of a word. | -- |
- | match_not_eow | -When this flag is set then last - can never match the end of a word. | -- |
- | match_not_dot_newline | -When this flag is set then a - dot expression "." can not match the newline - character. | -- |
- | match_not_dot_null | -When this flag is set then a - dot expression "." can not match a null - character. | -- |
- | match_prev_avail | -When this flag - is set, then *--first is a valid expression and - the flags match_not_bol and match_not_bow have no effect, - since the value of the previous character can be used to - check these. | -- |
- | match_any | -When this flag - is set, then the first string matched is returned, rather - than the longest possible match. This flag can - significantly reduce the time taken to find a match, but - what matches is undefined. | -- |
- | match_not_null | -When this flag - is set, then the expression will never match a null - string. | -- |
- | match_continuous | -When this flags - is set, then during a grep operation, each successive - match must start from where the previous match finished. | -- |
- | match_partial | -When this flag - is set, the regex algorithms will report partial matches - that is - where one or more characters at the end of the text input - matched some prefix of the regular expression. | -- |
- -
#include <boost/regex.hpp> -
- -The algorithm regex_search will search a range denoted by a -pair of bidirectional-iterators for a given regular expression. -The algorithm uses various heuristics to reduce the search time -by only checking for a match if a match could conceivably start -at that position. The algorithm is defined as follows:
- -template <class iterator, class Allocator, class charT, class traits, class Allocator2> -bool regex_search(iterator first, - iterator last, - match_results<iterator, Allocator>& m, - const reg_expression<charT, traits, Allocator2>& e, - unsigned flags = match_default);- -
The library also defines the following convenience versions, -which take either a const charT*, or a const std::basic_string<>& -in place of a pair of iterators [note - these versions may not be -available, or may be available in a more limited form, depending -upon your compilers capabilities]:
- -template <class charT, class Allocator, class traits, class Allocator2> -bool regex_search(const charT* str, - match_results<const charT*, Allocator>& m, - const reg_expression<charT, traits, Allocator2>& e, - unsigned flags = match_default); - -template <class ST, class SA, class Allocator, class charT, class traits, class Allocator2> -bool regex_search(const std::basic_string<charT, ST, SA>& s, - match_results<typename std::basic_string<charT, ST, SA>::const_iterator, Allocator>& m, - const reg_expression<charT, traits, Allocator2>& e, - unsigned flags = match_default);- -
The parameters for the main function version are as follows:
-
- | iterator first | -The starting position of the - range to search. | -- |
- | iterator last | -The ending position of the - range to search. | -- |
- | match_results<iterator, - Allocator>& m | -An instance of match_results
- in which what matched will be reported. On exit if a
- match occurred then m[0] denotes the whole of the string
- that matched, m[0].first and m[0].second will be less
- than or equal to last. m[1] denotes the first sub-expression
- m[2] the second sub-expression and so on. If no match
- occurred then m[0].first = m[0].second = last. Note - that since the match_results structure stores only - iterators, and not strings, the iterators/strings passed - to regex_search must be valid for as long as the result - is to be used. For that reason never pass temporary - string objects to regex_search. - |
- - |
- | const - reg_expression<charT, traits, Allocator2>& e | -The regular expression to - search for. | -- |
- | unsigned flags = - match_default | -The flags that determine - what gets matched, a combination of one or more match_flags enumerators. | -- |
-
Example: the following example, -takes the contents of a file in the form of a string, and -searches for all the C++ class declarations in the file. The code -will work regardless of the way that std::string is implemented, -for example it could easily be modified to work with the SGI rope -class, which uses a non-contiguous storage strategy.
- -#include <string> -#include <map> -#include <boost/regex.hpp> - -// purpose: -// takes the contents of a file in the form of a string -// and searches for all the C++ class definitions, storing -// their locations in a map of strings/int's -typedef std::map<std::string, int, std::less<std::string> > map_type; - -boost::regex expression("^(template[[:space:]]*<[^;:{]+>[[:space:]]*)?(class|struct)[[:space:]]*(\\<\\w+\\>([[:blank:]]*\\([^)]*\\))?[[:space:]]*)*(\\<\\w*\\>)[[:space:]]*(<[^;:{]+>[[:space:]]*)?(\\{|:[^;\\{()]*\\{)"); - -void IndexClasses(map_type& m, const std::string& file) -{ - std::string::const_iterator start, end; - start = file.begin(); - end = file.end(); - boost::match_results<std::string::const_iterator> what; - unsigned int flags = boost::match_default; - while(regex_search(start, end, what, expression, flags)) - { - // what[0] contains the whole string - // what[5] contains the class name. - // what[6] contains the template specialisation if any. - // add class name and position to map: - m[std::string(what[5].first, what[5].second) + std::string(what[6].first, what[6].second)] = - what[5].first - file.begin(); - // update search position: - start = what[0].second; - // update flags: - flags |= boost::match_prev_avail; - flags |= boost::match_not_bob; - } -} -- -
#include <boost/regex.hpp> -
- -Regex_grep allows you to search through a bidirectional-iterator -range and locate all the (non-overlapping) matches with a given -regular expression. The function is declared as:
- -template <class Predicate, class iterator, class charT, class traits, class Allocator> -unsigned int regex_grep(Predicate foo, - iterator first, - iterator last, - const reg_expression<charT, traits, Allocator>& e, - unsigned flags = match_default)- -
The library also defines the following convenience versions, -which take either a const charT*, or a const std::basic_string<>& -in place of a pair of iterators [note - these versions may not be -available, or may be available in a more limited form, depending -upon your compilers capabilities]:
- -template <class Predicate, class charT, class Allocator, class traits> -unsigned int regex_grep(Predicate foo, - const charT* str, - const reg_expression<charT, traits, Allocator>& e, - unsigned flags = match_default); - -template <class Predicate, class ST, class SA, class Allocator, class charT, class traits> -unsigned int regex_grep(Predicate foo, - const std::basic_string<charT, ST, SA>& s, - const reg_expression<charT, traits, Allocator>& e, - unsigned flags = match_default);- -
The parameters for the primary version of regex_grep have the
-following meanings:
-
- | foo | -A predicate function object - or function pointer, see below for more information. | -- |
- | first | -The start of the range to - search. | -- |
- | last | -The end of the range to - search. | -- |
- | e | -The regular expression to - search for. | -- |
- | flags | -The flags that determine how - matching is carried out, one of the match_flags - enumerators. | -- |
The algorithm finds all of the non-overlapping matches -of the expression e, for each match it fills a match_results<iterator, Allocator> -structure, which contains information on what matched, and calls -the predicate foo, passing the match_results<iterator, -Allocator> as a single argument. If the predicate returns -true, then the grep operation continues, otherwise it terminates -without searching for further matches. The function returns the -number of matches found.
- -The general form of the predicate is:
- -struct grep_predicate -{ - bool operator()(const match_results<iterator_type, expression_type::alloc_type>& m); -};- -
For example the regular expression "a*b" would find -one match in the string "aaaaab" and two in the string -"aaabb".
- -Remember this algorithm can be used for a lot more than -implementing a version of grep, the predicate can be and do -anything that you want, grep utilities would output the results -to the screen, another program could index a file based on a -regular expression and store a set of bookmarks in a list, or a -text file conversion utility would output to file. The results of -one regex_grep can even be chained into another regex_grep to -create recursive parsers.
- -Example: -convert the example from regex_search to use regex_grep -instead:
- -#include <string> -#include <map> -#include <boost/regex.hpp> - -// IndexClasses: -// takes the contents of a file in the form of a string -// and searches for all the C++ class definitions, storing -// their locations in a map of strings/int's - -typedef std::map<std::string, int, std::less<std::string> > map_type; - -boost::regex expression("^(template[[:space:]]*<[^;:{]+>[[:space:]]*)?" - "(class|struct)[[:space:]]*(\\<\\w+\\>([[:blank:]]*\\([^)]*\\))?[[:space:]]*)*(\\<\\w*\\>)" - "[[:space:]]*(<[^;:{]+>[[:space:]]*)?(\\{|:[^;\\{()]*\\{)"); - -class IndexClassesPred -{ - map_type& m; - std::string::const_iterator base; -public: - IndexClassesPred(map_type& a, std::string::const_iterator b) : m(a), base(b) {} - bool operator()(const match_results<std::string::const_iterator, regex::alloc_type>& what) - { - // what[0] contains the whole string - // what[5] contains the class name. - // what[6] contains the template specialisation if any. - // add class name and position to map: - m[std::string(what[5].first, what[5].second) + std::string(what[6].first, what[6].second)] = - what[5].first - base; - return true; - } -}; - -void IndexClasses(map_type& m, const std::string& file) -{ - std::string::const_iterator start, end; - start = file.begin(); - end = file.end(); - regex_grep(IndexClassesPred(m, start), start, end, expression); -}- -
Example: -Use regex_grep to call a global callback function:
- -#include <string> -#include <map> -#include <boost/regex.hpp> - -// purpose: -// takes the contents of a file in the form of a string -// and searches for all the C++ class definitions, storing -// their locations in a map of strings/int's - -typedef std::map<std::string, int, std::less<std::string> > map_type; - -boost::regex expression("^(template[[:space:]]*<[^;:{]+>[[:space:]]*)?(class|struct)[[:space:]]*(\\<\\w+\\>([[:blank:]]*\\([^)]*\\))?[[:space:]]*)*(\\<\\w*\\>)[[:space:]]*(<[^;:{]+>[[:space:]]*)?(\\{|:[^;\\{()]*\\{)"); - -map_type class_index; -std::string::const_iterator base; - -bool grep_callback(const boost::match_results<std::string::const_iterator, boost::regex::alloc_type>& what) -{ - // what[0] contains the whole string - // what[5] contains the class name. - // what[6] contains the template specialisation if any. - // add class name and position to map: - class_index[std::string(what[5].first, what[5].second) + std::string(what[6].first, what[6].second)] = - what[5].first - base; - return true; -} - -void IndexClasses(const std::string& file) -{ - std::string::const_iterator start, end; - start = file.begin(); - end = file.end(); - base = start; - regex_grep(grep_callback, start, end, expression, match_default); -} -- -
Example: -use regex_grep to call a class member function, use the standard -library adapters std::mem_fun and std::bind1st to -convert the member function into a predicate:
- -#include <string> -#include <map> -#include <boost/regex.hpp> -#include <functional> - -// purpose: -// takes the contents of a file in the form of a string -// and searches for all the C++ class definitions, storing -// their locations in a map of strings/int's - -typedef std::map<std::string, int, std::less<std::string> > map_type; - -class class_index -{ - boost::regex expression; - map_type index; - std::string::const_iterator base; - bool grep_callback(boost::match_results<std::string::const_iterator, boost::regex::alloc_type> what); -public: - void IndexClasses(const std::string& file); - class_index() - : index(), - expression("^(template[[:space:]]*<[^;:{]+>[[:space:]]*)?" - "(class|struct)[[:space:]]*(\\<\\w+\\>([[:blank:]]*\\([^)]*\\))?" - "[[:space:]]*)*(\\<\\w*\\>)[[:space:]]*(<[^;:{]+>[[:space:]]*)?" - "(\\{|:[^;\\{()]*\\{)" - ){} -}; - -bool class_index::grep_callback(boost::match_results<std::string::const_iterator, boost::regex::alloc_type> what) -{ - // what[0] contains the whole string - // what[5] contains the class name. - // what[6] contains the template specialisation if any. - // add class name and position to map: - index[std::string(what[5].first, what[5].second) + std::string(what[6].first, what[6].second)] = - what[5].first - base; - return true; -} - -void class_index::IndexClasses(const std::string& file) -{ - std::string::const_iterator start, end; - start = file.begin(); - end = file.end(); - base = start; - regex_grep(std::bind1st(std::mem_fun(&class_index::grep_callback), this), - start, - end, - expression); -} -- -
Finally, -C++ Builder users can use C++ Builder's closure type as a -callback argument:
- -#include <string> -#include <map> -#include <boost/regex.hpp> -#include <functional> - -// purpose: -// takes the contents of a file in the form of a string -// and searches for all the C++ class definitions, storing -// their locations in a map of strings/int's - -typedef std::map<std::string, int, std::less<std::string> > map_type; -class class_index -{ - boost::regex expression; - map_type index; - std::string::const_iterator base; - typedef boost::match_results<std::string::const_iterator, boost::regex::alloc_type> arg_type; - bool grep_callback(const arg_type& what); -public: - typedef bool (__closure* grep_callback_type)(const arg_type&); - void IndexClasses(const std::string& file); - class_index() - : index(), - expression("^(template[[:space:]]*<[^;:{]+>[[:space:]]*)?" - "(class|struct)[[:space:]]*(\\<\\w+\\>([[:blank:]]*\\([^)]*\\))?" - "[[:space:]]*)*(\\<\\w*\\>)[[:space:]]*(<[^;:{]+>[[:space:]]*)?" - "(\\{|:[^;\\{()]*\\{)" - ){} -}; - -bool class_index::grep_callback(const arg_type& what) -{ - // what[0] contains the whole string -// what[5] contains the class name. -// what[6] contains the template specialisation if any. -// add class name and position to map: -index[std::string(what[5].first, what[5].second) + std::string(what[6].first, what[6].second)] = - what[5].first - base; - return true; -} - -void class_index::IndexClasses(const std::string& file) -{ - std::string::const_iterator start, end; - start = file.begin(); - end = file.end(); - base = start; - class_index::grep_callback_type cl = &(this->grep_callback); - regex_grep(cl, - start, - end, - expression); -}- -
#include <boost/regex.hpp> -
- -The algorithm regex_format takes the results of a match and -creates a new string based upon a format string, -regex_format can be used for search and replace operations:
- -template <class OutputIterator, class iterator, class Allocator, class charT> -OutputIterator regex_format(OutputIterator out, - const match_results<iterator, Allocator>& m, - const charT* fmt, - unsigned flags = 0); - -template <class OutputIterator, class iterator, class Allocator, class charT> -OutputIterator regex_format(OutputIterator out, - const match_results<iterator, Allocator>& m, - const std::basic_string<charT>& fmt, - unsigned flags = 0);- -
The library also defines the following convenience variation -of regex_format, which returns the result directly as a string, -rather than outputting to an iterator [note - this version may -not be available, or may be available in a more limited form, -depending upon your compilers capabilities]:
- -template <class iterator, class Allocator, class charT> -std::basic_string<charT> regex_format - (const match_results<iterator, Allocator>& m, - const charT* fmt, - unsigned flags = 0); - -template <class iterator, class Allocator, class charT> -std::basic_string<charT> regex_format - (const match_results<iterator, Allocator>& m, - const std::basic_string<charT>& fmt, - unsigned flags = 0);- -
Parameters to the main version of the function are passed as
-follows:
-
- | OutputIterator out | -An output iterator type, the - output string is sent to this iterator. Typically this - would be a std::ostream_iterator. | -- |
- | const - match_results<iterator, Allocator>& m | -An instance of - match_results<> obtained from one of the matching - algorithms above, and denoting what matched. | -- |
- | const charT* fmt | -A format string that - determines how the match is transformed into the new - string. | -- |
- | unsigned flags | -Optional flags which - describe how the format string is to be interpreted. | -- |
Format flags are defined as follows:
-
-
- | format_all | -Enables all syntax options (perl-like - plus extentions). | -- |
- | format_sed | -Allows only a sed-like - syntax. | -- |
- | format_perl | -Allows only a perl-like - syntax. | -- |
- | format_no_copy | -Disables copying of - unmatched sections to the output string during regex_merge operations. | -- |
- | format_first_only | -When this flag is set only the first occurance will - be replaced (applies to regex_merge only). | -- |
-
The format string syntax (and available options) is described -more fully under format -strings.
- -#include <boost/regex.hpp> -
- -The algorithm regex_merge is a combination of regex_grep and regex_format. -That is, it greps through the string finding all the matches to -the regular expression, for each match it then calls regex_format to format the string and -sends the result to the output iterator. Sections of text that do -not match are copied to the output unchanged only if the flags -parameter does not have the flag format_no_copy -set. If the flag format_first_only is -set then only the first occurance is replaced rather than all -occurrences.
- -template <class OutputIterator, class iterator, class traits, class Allocator, class charT> -OutputIterator regex_merge(OutputIterator out, - iterator first, - iterator last, - const reg_expression<charT, traits, Allocator>& e, - const charT* fmt, - unsigned int flags = match_default); - -template <class OutputIterator, class iterator, class traits, class Allocator, class charT> -OutputIterator regex_merge(OutputIterator out, - iterator first, - iterator last, - const reg_expression<charT, traits, Allocator>& e, - std::basic_string<charT>& fmt, - unsigned int flags = match_default);- -
The library also defines the following convenience variation -of regex_merge, which returns the result directly as a string, -rather than outputting to an iterator [note - this version may -not be available, or may be available in a more limited form, -depending upon your compilers capabilities]:
- -template <class traits, class Allocator, class charT> -std::basic_string<charT> regex_merge(const std::basic_string<charT>& text, - const reg_expression<charT, traits, Allocator>& e, - const charT* fmt, - unsigned int flags = match_default); - -template <class traits, class Allocator, class charT> -std::basic_string<charT> regex_merge(const std::basic_string<charT>& text, - const reg_expression<charT, traits, Allocator>& e, - const std::basic_string<charT>& fmt, - unsigned int flags = match_default);- -
Parameters to the main version of the function are passed as
-follows:
-
- | OutputIterator out | -An output iterator type, the - output string is sent to this iterator. Typically this - would be a std::ostream_iterator. | -- |
- | iterator first | -The start of the range of - text to grep (bidirectional-iterator). | -- |
- | iterator last | -The end of the range of text - to grep (bidirectional-iterator). | -- |
- | const - reg_expression<charT, traits, Allocator>& e | -The expression to search for. | -- |
- | const charT* fmt | -The format string to be - applied to sections of text that match. | -- |
- | unsigned int - flags = match_default | -Flags which determine how - the expression is matched - see match_flags, - and how the format string is interpreted - see format_flags. | -- |
Example: the following example takes -C/C++ source code as input, and outputs syntax highlighted HTML -code.
- --#include <fstream> -#include <sstream> -#include <string> -#include <iterator> -#include <boost/regex.hpp> -#include <fstream> -#include <iostream> - -// purpose: -// takes the contents of a file and transform to -// syntax highlighted code in html format - -boost::regex e1, e2; -extern const char* expression_text; -extern const char* format_string; -extern const char* pre_expression; -extern const char* pre_format; -extern const char* header_text; -extern const char* footer_text; - -void load_file(std::string& s, std::istream& is) -{ - s.erase(); - s.reserve(is.rdbuf()->in_avail()); - char c; - while(is.get(c)) - { - if(s.capacity() == s.size()) - s.reserve(s.capacity() * 3); - s.append(1, c); - } -} - -int main(int argc, const char** argv) -{ - try{ - e1.assign(expression_text); - e2.assign(pre_expression); - for(int i = 1; i < argc; ++i) - { - std::cout << "Processing file " << argv[i] << std::endl; - std::ifstream fs(argv[i]); - std::string in; - load_file(in, fs); - std::string out_name(std::string(argv[i]) + std::string(".htm")); - std::ofstream os(out_name.c_str()); - os << header_text; - // strip '<' and '>' first by outputting to a - // temporary string stream - std::ostringstream t(std::ios::out | std::ios::binary); - std::ostream_iterator<char, char> oi(t); - boost::regex_merge(oi, in.begin(), in.end(), e2, pre_format); - // then output to final output stream - // adding syntax highlighting: - std::string s(t.str()); - std::ostream_iterator<char, char> out(os); - boost::regex_merge(out, s.begin(), s.end(), e1, format_string); - os << footer_text; - } - } - catch(...) - { return -1; } - return 0; -} - -extern const char* pre_expression = "(<)|(>)|\\r"; -extern const char* pre_format = "(?1<)(?2>)"; - - -const char* expression_text = // preprocessor directives: index 1 - "(^[[:blank:]]*#(?:[^\\\\\\n]|\\\\[^\\n[:punct:][:word:]]*[\\n[:punct:][:word:]])*)|" - // comment: index 2 - "(//[^\\n]*|/\\*.*?\\*/)|" - // literals: index 3 - "\\<([+-]?(?:(?:0x[[:xdigit:]]+)|(?:(?:[[:digit:]]*\\.)?[[:digit:]]+(?:[eE][+-]?[[:digit:]]+)?))u?(?:(?:int(?:8|16|32|64))|L)?)\\>|" - // string literals: index 4 - "('(?:[^\\\\']|\\\\.)*'|\"(?:[^\\\\\"]|\\\\.)*\")|" - // keywords: index 5 - "\\<(__asm|__cdecl|__declspec|__export|__far16|__fastcall|__fortran|__import" - "|__pascal|__rtti|__stdcall|_asm|_cdecl|__except|_export|_far16|_fastcall" - "|__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto|bool" - "|break|case|catch|cdecl|char|class|const|const_cast|continue|default|delete" - "|do|double|dynamic_cast|else|enum|explicit|extern|false|float|for|friend|goto" - "|if|inline|int|long|mutable|namespace|new|operator|pascal|private|protected" - "|public|register|reinterpret_cast|return|short|signed|sizeof|static|static_cast" - "|struct|switch|template|this|throw|true|try|typedef|typeid|typename|union|unsigned" - "|using|virtual|void|volatile|wchar_t|while)\\>" - ; - -const char* format_string = "(?1<font color=\"#008040\">$&</font>)" - "(?2<I><font color=\"#000080\">$&</font></I>)" - "(?3<font color=\"#0000A0\">$&</font>)" - "(?4<font color=\"#0000FF\">$&</font>)" - "(?5<B>$&</B>)"; - -const char* header_text = "<HTML>\n<HEAD>\n" - "<TITLE>Auto-generated html formated source</TITLE>\n" - "<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=windows-1252\">\n" - "</HEAD>\n" - "<BODY LINK=\"#0000ff\" VLINK=\"#800080\" BGCOLOR=\"#ffffff\">\n" - "<P> </P>\n<PRE>"; - -const char* footer_text = "</PRE>\n</BODY>\n\n";- -
#include <boost/regex.hpp> -
- -Algorithm regex_split performs a similar operation to the perl -split operation, and comes in three overloaded forms:
- -template <class OutputIterator, class charT, class Traits1, class Alloc1, class Traits2, class Alloc2> -std::size_t regex_split(OutputIterator out, - std::basic_string<charT, Traits1, Alloc1>& s, - const reg_expression<charT, Traits2, Alloc2>& e, - unsigned flags, - std::size_t max_split); - -template <class OutputIterator, class charT, class Traits1, class Alloc1, class Traits2, class Alloc2> -std::size_t regex_split(OutputIterator out, - std::basic_string<charT, Traits1, Alloc1>& s, - const reg_expression<charT, Traits2, Alloc2>& e, - unsigned flags = match_default); - -template <class OutputIterator, class charT, class Traits1, class Alloc1> -std::size_t regex_split(OutputIterator out, - std::basic_string<charT, Traits1, Alloc1>& s);- -
Each version takes an output-iterator for output, and a string -for input. If the expression contains no marked sub-expressions, -then the algorithm writes one string onto the output-iterator for -each section of input that does not match the expression. If the -expression does contain marked sub-expressions, then each time a -match is found, one string for each marked sub-expression will be -written to the output-iterator. No more than max_split strings -will be written to the output-iterator. Before returning, all the -input processed will be deleted from the string s (if max_split -is not reached then all of s will be deleted). Returns -the number of strings written to the output-iterator. If the -parameter max_split is not specified then it defaults to -UINT_MAX. If no expression is specified, then it defaults to -"\s+", and splitting occurs on whitespace.
- -Example: -the following function will split the input string into a series -of tokens, and remove each token from the string s:
- -unsigned tokenise(std::list<std::string>& l, std::string& s) -{ - return boost::regex_split(std::back_inserter(l), s); -}- -
Example: -the following short program will extract all of the URL's from a -html file, and print them out to cout:
- -#include <list> -#include <fstream> -#include <iostream> -#include <boost/regex.hpp> - -boost::regex e("<\\s*A\\s+[^>]*href\\s*=\\s*\"([^\"]*)\"", - boost::regbase::normal | boost::regbase::icase); - -void load_file(std::string& s, std::istream& is) -{ - s.erase(); - // - // attempt to grow string buffer to match file size, - // this doesn't always work... - s.reserve(is.rdbuf()->in_avail()); - char c; - while(is.get(c)) - { - // use logarithmic growth stategy, in case - // in_avail (above) returned zero: - if(s.capacity() == s.size()) - s.reserve(s.capacity() * 3); - s.append(1, c); - } -} - - -int main(int argc, char** argv) -{ - std::string s; - std::list<std::string> l; - - for(int i = 1; i < argc; ++i) - { - std::cout << "Findings URL's in " << argv[i] << ":" << std::endl; - s.erase(); - std::ifstream is(argv[i]); - load_file(s, is); - boost::regex_split(std::back_inserter(l), s, e); - while(l.size()) - { - s = *(l.begin()); - l.pop_front(); - std::cout << s << std::endl; - } - } - return 0; -}- -
The match-flag match_partial
can be passed to the
-following algorithms: regex_match, regex_search, and regex_grep.
-When used it indicates that partial as well as full matches
-should be found. A partial match is one that matched one or more
-characters at the end of the text input, but did not match all of
-the regular expression (although it may have done so had more
-input been available). Partial matches are typically used when
-either validating data input (checking each character as it is
-entered on the keyboard), or when searching texts that are either
-too long to load into memory (or even into a memory mapped file),
-or are of indeterminate length (for example the source may be a
-socket or similar). Partial and full matches can be
-differentiated as shown in the following table (the variable M
-represents an instance of match_results<> as filled in by
-regex_match, regex_search or regex_grep):
-
- | Result | -M[0].matched | -M[0].first | -M[0].second | -
No match | -False | -Undefined | -Undefined | -Undefined | -
Partial match | -True | -False | -Start of partial match. | -End of partial match (end of - text). | -
Full match | -True | -True | -Start of full match. | -End of full match. | -
The following example tests
-to see whether the text could be a valid credit card number, as
-the user presses a key, the character entered would be added to
-the string being built up, and passed to is_possible_card_number
.
-If this returns true then the text could be a valid card number,
-so the user interface's OK button would be enabled. If it returns
-false, then this is not yet a valid card number, but could be
-with more input, so the user interface would disable the OK
-button. Finally, if the procedure throws an exception the input
-could never become a valid number, and the inputted character
-must be discarded, and a suitable error indication displayed to
-the user.
#include <string> -#include <iostream> -#include <boost/regex.hpp> - -boost::regex e("(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})"); - -bool is_possible_card_number(const std::string& input) -{ - // - // return false for partial match, true for full match, or throw for - // impossible match based on what we have so far... - boost::match_results<std::string::const_iterator> what; - if(0 == boost::regex_match(input, what, e, boost::match_default | boost::match_partial)) - { - // the input so far could not possibly be valid so reject it: - throw std::runtime_error("Invalid data entered - this could not possibly be a valid card number"); - } - // OK so far so good, but have we finished? - if(what[0].matched) - { - // excellent, we have a result: - return true; - } - // what we have so far is only a partial match... - return false; -}- -
In the following example, text -input is taken from a stream containing an unknown amount of -text; this example simply counts the number of html tags -encountered in the stream. The text is loaded into a buffer and -searched a part at a time, if a partial match was encountered, -then the partial match gets searched a second time as the start -of the next batch of text:
- -#include <iostream> -#include <fstream> -#include <sstream> -#include <string> -#include <boost/regex.hpp> - -// match some kind of html tag: -boost::regex e("<[^>]*>"); -// count how many: -unsigned int tags = 0; -// saved position of partial match: -char* next_pos = 0; - -bool grep_callback(const boost::match_results<char*>& m) -{ - if(m[0].matched == false) - { - // save position and return: - next_pos = m[0].first; - } - else - ++tags; - return true; -} - -void search(std::istream& is) -{ - char buf[4096]; - next_pos = buf + sizeof(buf); - bool have_more = true; - while(have_more) - { - // how much do we copy forward from last try: - unsigned leftover = (buf + sizeof(buf)) - next_pos; - // and how much is left to fill: - unsigned size = next_pos - buf; - // copy forward whatever we have left: - memcpy(buf, next_pos, leftover); - // fill the rest from the stream: - unsigned read = is.readsome(buf + leftover, size); - // check to see if we've run out of text: - have_more = read == size; - // reset next_pos: - next_pos = buf + sizeof(buf); - // and then grep: - boost::regex_grep(grep_callback, - buf, - buf + read + leftover, - e, - boost::match_default | boost::match_partial); - } -}- -
Copyright Dr -John Maddock 1998-2001 all rights reserved.
- - diff --git a/traits_class_ref.htm b/traits_class_ref.htm deleted file mode 100644 index 669f5a87..00000000 --- a/traits_class_ref.htm +++ /dev/null @@ -1,1016 +0,0 @@ - - - - - - - -Regex++, Traits Class - Reference.-Copyright (c) 1998-2001 -Dr John Maddock -Permission to use, copy, modify, - distribute and sell this software and its documentation - for any purpose is hereby granted without fee, provided - that the above copyright notice appear in all copies and - that both that copyright notice and this permission - notice appear in supporting documentation. Dr John - Maddock makes no representations about the suitability of - this software for any purpose. It is provided "as is" - without express or implied warranty. - |
-
This section describes the traits class requirements of the -reg_expression template class, these requirements are somewhat -complex (sorry), and subject to change as uses ask for new -features, however I will try to keep them stable for a while, and -ideally the requirements should lessen rather than increase.
- -The reg_expression traits classes encapsulate both the -properties of a character type, and the properties of the locale -associated with that type. The associated locale may be defined -at run-time (via std::locale), or hard-coded into the traits -class and determined at compile time.
- -The following example class illustrates the interface required -by a "typical" traits class for use with class -reg_expression:
- --class mytraits -{ - typedef implementation_defined char_type; - typedef implementation_defined uchar_type; - typedef implementation_defined size_type; - typedef implementation_defined string_type; - typedef implementation_defined locale_type; - typedef implementation_defined uint32_t; - struct sentry - { - sentry(const mytraits&); - operator void*() { return this; } - }; - - enum char_syntax_type - { - syntax_char = 0, - syntax_open_bracket = 1, // ( - syntax_close_bracket = 2, // ) - syntax_dollar = 3, // $ - syntax_caret = 4, // ^ - syntax_dot = 5, // . - syntax_star = 6, // * - syntax_plus = 7, // + - syntax_question = 8, // ? - syntax_open_set = 9, // [ - syntax_close_set = 10, // ] - syntax_or = 11, // | - syntax_slash = 12, // - syntax_hash = 13, // # - syntax_dash = 14, // - - syntax_open_brace = 15, // { - syntax_close_brace = 16, // } - syntax_digit = 17, // 0-9 - syntax_b = 18, // for \b - syntax_B = 19, // for \B - syntax_left_word = 20, // for \< - syntax_right_word = 21, // for \ - syntax_w = 22, // for \w - syntax_W = 23, // for \W - syntax_start_buffer = 24, // for \` - syntax_end_buffer = 25, // for \' - syntax_newline = 26, // for newline alt - syntax_comma = 27, // for {x,y} - - syntax_a = 28, // for \a - syntax_f = 29, // for \f - syntax_n = 30, // for \n - syntax_r = 31, // for \r - syntax_t = 32, // for \t - syntax_v = 33, // for \v - syntax_x = 34, // for \xdd - syntax_c = 35, // for \cx - syntax_colon = 36, // for [:...:] - syntax_equal = 37, // for [=...=] - - // perl ops: - syntax_e = 38, // for \e - syntax_l = 39, // for \l - syntax_L = 40, // for \L - syntax_u = 41, // for \u - syntax_U = 42, // for \U - syntax_s = 43, // for \s - syntax_S = 44, // for \S - syntax_d = 45, // for \d - syntax_D = 46, // for \D - syntax_E = 47, // for \Q\E - syntax_Q = 48, // for \Q\E - syntax_X = 49, // for \X - syntax_C = 50, // for \C - syntax_Z = 51, // for \Z - syntax_G = 52, // for \G - syntax_bang = 53, // reserved for future use '!' - syntax_and = 54, // reserve for future use '&' - }; - - enum{ - char_class_none = 0, - char_class_alpha, - char_class_cntrl, - char_class_digit, - char_class_lower, - char_class_punct, - char_class_space, - char_class_upper, - char_class_xdigit, - char_class_blank, - char_class_unicode, - char_class_alnum, - char_class_graph, - char_class_print, - char_class_word - }; - - static size_t length(const char_type* p); - unsigned int syntax_type(size_type c)const; - char_type translate(char_type c, bool icase)const; - void transform(string_type& out, const string_type& in)const; - void transform_primary(string_type& out, const string_type& in)const; - bool is_separator(char_type c)const; - bool is_combining(char_type)const; - bool is_class(char_type c, uint32_t f)const; - int toi(char_type c)const; - int toi(const char_type*& first, const char_type* last, int radix)const; - uint32_t lookup_classname(const char_type* first, const char_type* last)const; - bool lookup_collatename(string_type& buf, const char_type* first, const char_type* last)const; - locale_type imbue(locale_type l); - locale_type getloc()const; - std::string error_string(unsigned id)const; - - mytraits(); - ~mytraits(); -}; -- -
The member types required by a traits class are defined as
-follows:
-
- | Member - name | -Description - | -- |
- | char_type | -The - character type encapsulated by this traits class, must be - a POD type, and be convertible to uchar_type. | -- |
- | uchar_type - | -The - unsigned type corresponding to char_type, must be - convertible to size_type. | -- |
- | size_type | -An - unsigned integral type, with at least as much precision - as uchar_type. | -- |
- | string_type - | -A type - that offers the same facilities as std::basic_string<char_type. - This is used for collating elements, and sort strings, if - char_type has no locale dependent collation (it is not a - "character"), then it could be something - simpler than std::basic_string. | -- |
- | locale_type - | -A type - that encapsulates the locale used by the traits class, - probably std::locale but could be a platform specific - type, or a dummy type if per-instance locales are not - supported by the traits class. | -- |
- | uint32_t | -An - unsigned integral type with at least 32-bits of - precision, used as a bitmask type for character - classification. | -- |
- | sentry | -A class or
- struct type which is constructible from an instance of
- the traits class, and is convertible to void*. An
- instance of type sentry will be constructed before
- compiling each regular expression, it provides an
- opportunity to carry out prefix/suffix operations on the
- traits class. For example a traits class that - encapsulates the global locale, can use this as an - opportunity to synchronize with the global locale (by - updating any cached data). - |
- - |
- The following member constants are used to represent the
-locale independent syntax of a regular expression; the member
-function syntax_type returns one of these values, and is
-used to convert a locale dependent regular expression, into a
-locale-independent sequence of tokens.
-
- | Member - constant | -English - language representation | -- |
- | syntax_char - | -All non-special - characters. | -- |
- | syntax_open_bracket - | -( | -- |
- | syntax_close_bracket - | -) | -- |
- | syntax_dollar - | -$ | -- |
- | syntax_caret - | -^ | -- |
- | syntax_dot - | -. | -- |
- | syntax_star - | -* | -- |
- | syntax_plus - | -+ | -- |
- | syntax_question - | -? | -- |
- | syntax_open_set - | -[ | -- |
- | syntax_close_set - | -] | -- |
- | syntax_or - | -| | -- |
- | syntax_slash - | -\ | -- |
- | syntax_hash - | -# | -- |
- | syntax_dash - | -- | -- |
- | syntax_open_brace - | -{ | -- |
- | syntax_close_brace - | -} | -- |
- | syntax_digit - | -0123456789 - | -- |
- | syntax_b - | -b | -- |
- | syntax_B - | -B | -- |
- | syntax_left_word - | -< - | -- |
- | syntax_right_word - | -- | - |
- | syntax_w - | -w | -- |
- | syntax_W - | -W | -- |
- | syntax_start_buffer - | -` | -- |
- | syntax_end_buffer - | -' | -- |
- | syntax_newline - | -\n | -- |
- | syntax_comma - | -, | -- |
- | syntax_a - | -a | -- |
- | syntax_f - | -f | -- |
- | syntax_n - | -n | -- |
- | syntax_r - | -r | -- |
- | syntax_t - | -t | -- |
- | syntax_v - | -v | -- |
- | syntax_x - | -x | -- |
- | syntax_c - | -c | -- |
- | syntax_colon - | -: | -- |
- | syntax_equal - | -= | -- |
- | syntax_e - | -e | -- |
- | syntax_l - | -l | -- |
- | syntax_L - | -L | -- |
- | syntax_u - | -u | -- |
- | syntax_U - | -U | -- |
- | syntax_s - | -s | -- |
- | syntax_S - | -S | -- |
- | syntax_d - | -d | -- |
- | syntax_D - | -D | -- |
- | syntax_E - | -E | -- |
- | syntax_Q - | -Q | -- |
- | syntax_X - | -X | -- |
- | syntax_C - | -C | -- |
- | syntax_Z - | -Z | -- |
- | syntax_G - | -G | -- |
- | syntax_bang - | -! | -- |
- | syntax_and - | -& - | -- |
The following member constants are used to represent
-particular character classifications:
-
- | Member - constant | -Description - | -- |
- | char_class_none - | -No - classification, must be zero. | -- |
- | char_class_alpha - | -All - alphabetic characters. | -- |
- | char_class_cntrl - | -All - control characters. | -- |
- | char_class_digit - | -All - decimal digits. | -- |
- | char_class_lower - | -All lower - case characters. | -- |
- | char_class_punct - | -All - punctuation characters. | -- |
- | char_class_space - | -All white-space - characters. | -- |
- | char_class_upper - | -All upper - case characters. | -- |
- | char_class_xdigit - | -All - hexadecimal digit characters. | -- |
- | char_class_blank - | -All blank - characters (space + tab). | -- |
- | char_class_unicode - | -All - extended unicode characters - those that can not be - represented as a single narrow character. | -- |
- | char_class_alnum - | -All alpha-numeric - characters. | -- |
- | char_class_graph - | -All - graphic characters. | -- |
- | char_class_print - | -All - printable characters. | -- |
- | char_class_word - | -All word - characters (alphanumeric characters + the underscore). | -- |
The following member functions are required by all regular
-expression traits classes, those members that are declared here
-as const, could be declared static instead if the
-class does not contain instance data:
-
- | Member - function | -Description - | -- |
- | static - size_t length(const char_type* p); | -Returns - the length of the null-terminated string p. | -- |
- | unsigned - int syntax_type(size_type c)const; | - Converts
- an input character into a locale independent token (one
- of the syntax_xxx member constants). Called when parsing
- the regular expression into a locale-independent parse
- tree. Example: in English language regular - expressions we would use "[[:word:]]" to - represent the character class of all word characters, and - "\w" as a shortcut for this. Consequently - syntax_type('w') returns syntax_w. In French language - regular expressions, we would use "[[:mot:]]" - in place of "[[:word:]]" and therefore "\m" - in place of "\w", therefore it is syntax_type('m') - that returns syntax_w. - |
- - |
- | char_type - translate(char_type c, bool icase)const; | - Translates
- an input character into a unique identifier that
- represents the equivalence class that that character
- belongs to. If icase is true, then the returned value is
- insensitive to case. [An equivalence class is - the set of all characters that must be treated as being - equivalent to each other.] - |
- - |
- | void - transform(string_type& out, const string_type& in)const; - | -Transforms - the string in, into a locale-dependent sort key, - and stores the result in out. | -- |
- | void - transform_primary(string_type& out, const - string_type& in)const; | -Transforms - the string in, into a locale-dependent primary - sort key, and stores the result in out. | -- |
- | bool - is_separator(char_type c)const; | -Returns - true only if c is a line separator. | -- |
- | bool - is_combining(char_type c)const; | -Returns - true only if c is a unicode combining character. | -- |
- | bool - is_class(char_type c, uint32_t f)const; | -Returns - true only if c is a member of one of the character - classes represented by the bitmap f. | -- |
- | int toi(char_type - c)const; | - Converts
- the character c to a decimal integer. [Precondition: - is_class(c,char_class_digit)==true] - |
- - |
- | int toi(const - char_type*& first, const char_type* last, int radix)const; - | - Converts
- the string [first-last) into an integral value using base
- radix. Stops when it finds the first non-digit
- character, and sets first to point to that
- character. [Precondition: is_class(*first,char_class_digit)==true] - - |
- - |
- | uint32_t - lookup_classname(const char_type* first, const char_type* - last)const; | -Returns - the bitmap representing the character class [first-last), - or char_class_none if [first-last) is not recognized as a - character class name. | -- |
- | bool - lookup_collatename(string_type& buf, const char_type* - first, const char_type* last)const; | -If the - sequence [first-last) is the name of a known collating - element, then stores the collating element in buf, and - returns true, otherwise returns false. | -- |
- | locale_type - imbue(locale_type l); | -Imbues - the class with the locale l. | -- |
- | locale_type - getloc()const; | -Returns - the traits-class locale. | -- |
- | std::string - error_string(unsigned id)const; | -Returns - the locale-dependent error-string associated with the - error-number id. The parameter id is one of - the REG_XXX error codes described by the POSIX standard, - and defined in <boost/cregex.hpp. | -- |
- | mytraits(); - | -Constructor. - | -- |
- | ~ mytraits(); - | -Destructor. - | -- |
There is also an example of a custom traits class supplied by Christian Engström,
-see iso8859_1_regex_traits.cpp
-and iso8859_1_regex_traits.hpp.
-This example inherits from c_regex_traits and provides it's own
-implementations of two locale specific functions. This ensures
-that the class gives consistent behaviour (albeit tied to one
-locale) on all platforms. A fuller desciption by the author is
-available in the readme file.
-
Copyright Dr -John Maddock 1998-2001 all rights reserved.
- -