This section covers the regular expression syntax used by this library, this is
a programmers guide, the actual syntax presented to your program's users will
- depend upon the flags used during expression compilation.
-
-
Literals
-
-
All characters are literals except: ".", "|", "*", "?", "+", "(", ")", "{",
- "}", "[", "]", "^", "$" and "\". These characters are literals when preceded by
- a "\". A literal is a character that matches itself, or matches the result of
- traits_type::translate(), where traits_type is the traits template parameter to
- class basic_regex.
-
Wildcard
-
-
The dot character "." matches any single character except : when match_not_dot_null
- is passed to the matching algorithms, the dot does not match a null character;
- when match_not_dot_newline is passed to the matching algorithms, then
- the dot does not match a newline character.
-
-
Repeats
-
-
A repeat is an expression that is repeated an arbitrary number of times. An
- expression followed by "*" can be repeated any number of times including zero.
- An expression followed by "+" can be repeated any number of times, but at least
- once, if the expression is compiled with the flag regex_constants::bk_plus_qm
- then "+" is an ordinary character and "\+" represents a repeat of once or more.
- An expression followed by "?" may be repeated zero or one times only, if the
- expression is compiled with the flag regex_constants::bk_plus_qm then "?" is an
- ordinary character and "\?" represents the repeat zero or once operator. When
- it is necessary to specify the minimum and maximum number of repeats
- explicitly, the bounds operator "{}" may be used, thus "a{2}" is the letter "a"
- repeated exactly twice, "a{2,4}" represents the letter "a" repeated between 2
- and 4 times, and "a{2,}" represents the letter "a" repeated at least twice with
- no upper limit. Note that there must be no white-space inside the {}, and there
- is no upper limit on the values of the lower and upper bounds. When the
- expression is compiled with the flag regex_constants::bk_braces then "{" and
- "}" are ordinary characters and "\{" and "\}" are used to delimit bounds
- instead. All repeat expressions refer to the shortest possible previous
- sub-expression: a single character; a character set, or a sub-expression
- grouped with "()" for example.
-
-
Examples:
-
-
"ba*" will match all of "b", "ba", "baaa" etc.
-
-
"ba+" will match "ba" or "baaaa" for example but not "b".
-
-
"ba?" will match "b" or "ba".
-
-
"ba{2,4}" will match "baa", "baaa" and "baaaa".
-
-
Non-greedy repeats
-
-
Whenever the "extended" regular expression syntax is in use (the default) then
- non-greedy repeats are possible by appending a '?' after the repeat; a
- non-greedy repeat is one which will match the shortest possible string.
-
-
For example to match html tag pairs one could use something like:
-
-
"<\s*tagname[^>]*>(.*?)<\s*/tagname\s*>"
-
-
In this case $1 will contain the text between the tag pairs, and will be the
- shortest possible matching string.
-
-
Parenthesis
-
-
Parentheses serve two purposes, to group items together into a sub-expression,
- and to mark what generated the match. For example the expression "(ab)*" would
- match all of the string "ababab". The matching algorithms
- regex_match and regex_search each take
- an instance of match_results that reports what
- caused the match, on exit from these functions the match_results
- contains information both on what the whole expression matched and on what each
- sub-expression matched. In the example above match_results[1] would contain a
- pair of iterators denoting the final "ab" of the matching string. It is
- permissible for sub-expressions to match null strings. If a sub-expression
- takes no part in a match - for example if it is part of an alternative that is
- not taken - then both of the iterators that are returned for that
- sub-expression point to the end of the input string, and the matched parameter
- for that sub-expression is false. Sub-expressions are indexed from left
- to right starting from 1, sub-expression 0 is the whole expression.
-
-
Non-Marking Parenthesis
-
-
Sometimes you need to group sub-expressions with parenthesis, but don't want
- the parenthesis to spit out another marked sub-expression, in this case a
- non-marking parenthesis (?:expression) can be used. For example the following
- expression creates no sub-expressions:
-
-
"(?:abc)*"
-
Forward Lookahead Asserts
-
-
There are two forms of these; one for positive forward lookahead asserts, and
- one for negative lookahead asserts:
-
"(?=abc)" matches zero characters only if they are followed by the expression
- "abc".
-
"(?!abc)" matches zero characters only if they are not followed by the
- expression "abc".
-
Independent sub-expressions
-
"(?>expression)" matches "expression" as an independent atom (the algorithm
- will not backtrack into it if a failure occurs later in the expression).
-
Alternatives
-
-
Alternatives occur when the expression can match either one sub-expression or
- another, each alternative is separated by a "|", or a "\|" if the flag
- regex_constants::bk_vbar is set, or by a newline character if the flag
- regex_constants::newline_alt is set. Each alternative is the largest possible
- previous sub-expression; this is the opposite behavior from repetition
- operators.
-
-
Examples:
-
-
"a(b|c)" could match "ab" or "ac".
-
-
"abc|def" could match "abc" or "def".
-
-
Sets
-
-
A set is a set of characters that can match any single character that is a
- member of the set. Sets are delimited by "[" and "]" and can contain literals,
- character ranges, character classes, collating elements and equivalence
- classes. Set declarations that start with "^" contain the complement of the
- elements that follow.
-
-
Examples:
-
-
Character literals:
-
-
"[abc]" will match either of "a", "b", or "c".
-
-
"[^abc] will match any character other than "a", "b", or "c".
-
-
Character ranges:
-
-
"[a-z]" will match any character in the range "a" to "z".
-
-
"[^A-Z]" will match any character other than those in the range "A" to "Z".
-
-
Note that character ranges are highly locale dependent if the flag
- regex_constants::collate is set: they match any character that collates between
- the endpoints of the range, ranges will only behave according to ASCII rules
- when the default "C" locale is in effect. For example if the library is
- compiled with the Win32 localization model, then [a-z] will match the ASCII
- characters a-z, and also 'A', 'B' etc, but not 'Z' which collates just after
- 'z'. This locale specific behavior is disabled by default (in perl mode), and
- forces ranges to collate according to ASCII character code.
-
-
Character classes are denoted using the syntax "[:classname:]" within a set
- declaration, for example "[[:space:]]" is the set of all whitespace characters.
- Character classes are only available if the flag regex_constants::char_classes
- is set. The available character classes are:
-
-
-
-
-
-
-
-
alnum
-
Any alpha numeric character.
-
-
-
-
-
alpha
-
Any alphabetical character a-z and A-Z. Other
- characters may also be included depending upon the locale.
-
-
-
-
-
blank
-
Any blank character, either a space or a tab.
-
-
-
-
-
cntrl
-
Any control character.
-
-
-
-
-
digit
-
Any digit 0-9.
-
-
-
-
-
graph
-
Any graphical character.
-
-
-
-
-
lower
-
Any lower case character a-z. Other characters may
- also be included depending upon the locale.
-
-
-
-
-
print
-
Any printable character.
-
-
-
-
-
punct
-
Any punctuation character.
-
-
-
-
-
space
-
Any whitespace character.
-
-
-
-
-
upper
-
Any upper case character A-Z. Other characters may
- also be included depending upon the locale.
-
-
-
-
-
xdigit
-
Any hexadecimal digit character, 0-9, a-f and A-F.
-
-
-
-
-
word
-
Any word character - all alphanumeric characters plus
- the underscore.
-
-
-
-
-
Unicode
-
Any character whose code is greater than 255, this
- applies to the wide character traits classes only.
-
-
-
-
-
There are some shortcuts that can be used in place of the character classes,
- provided the flag regex_constants::escape_in_lists is set then you can use:
-
-
\w in place of [:word:]
-
-
\s in place of [:space:]
-
-
\d in place of [:digit:]
-
-
\l in place of [:lower:]
-
-
\u in place of [:upper:]
-
-
Collating elements take the general form [.tagname.] inside a set declaration,
- where tagname is either a single character, or a name of a collating
- element, for example [[.a.]] is equivalent to [a], and [[.comma.]] is
- equivalent to [,]. The library supports all the standard POSIX collating
- element names, and in addition the following digraphs: "ae", "ch", "ll", "ss",
- "nj", "dz", "lj", each in lower, upper and title case variations.
- Multi-character collating elements can result in the set matching more than one
- character, for example [[.ae.]] would match two characters, but note that
- [^[.ae.]] would only match one character.
-
-
- Equivalence classes take the generalform[=tagname=] inside a set declaration,
- where tagname is either a single character, or a name of a collating
- element, and matches any character that is a member of the same primary
- equivalence class as the collating element [.tagname.]. An equivalence class is
- a set of characters that collate the same, a primary equivalence class is a set
- of characters whose primary sort key are all the same (for example strings are
- typically collated by character, then by accent, and then by case; the primary
- sort key then relates to the character, the secondary to the accentation, and
- the tertiary to the case). If there is no equivalence class corresponding to tagname
- ,then[=tagname=] is exactly the same as [.tagname.]. Unfortunately there is no
- locale independent method of obtaining the primary sort key for a character,
- except under Win32. For other operating systems the library will "guess" the
- primary sort key from the full sort key (obtained from strxfrm), so
- equivalence classes are probably best considered broken under any operating
- system other than Win32.
-
-
To include a literal "-" in a set declaration then: make it the first character
- after the opening "[" or "[^", the endpoint of a range, a collating element, or
- if the flag regex_constants::escape_in_lists is set then precede with an escape
- character as in "[\-]". To include a literal "[" or "]" or "^" in a set then
- make them the endpoint of a range, a collating element, or precede with an
- escape character if the flag regex_constants::escape_in_lists is set.
-
-
Line anchors
-
-
An anchor is something that matches the null string at the start or end of a
- line: "^" matches the null string at the start of a line, "$" matches the null
- string at the end of a line.
-
-
Back references
-
-
A back reference is a reference to a previous sub-expression that has already
- been matched, the reference is to what the sub-expression matched, not to the
- expression itself. A back reference consists of the escape character "\"
- followed by a digit "1" to "9", "\1" refers to the first sub-expression, "\2"
- to the second etc. For example the expression "(.*)\1" matches any string that
- is repeated about its mid-point for example "abcabc" or "xyzxyz". A back
- reference to a sub-expression that did not participate in any match, matches
- the null string: NB this is different to some other regular expression
- matchers. Back references are only available if the expression is compiled with
- the flag regex_constants::bk_refs set.
-
-
Characters by code
-
-
This is an extension to the algorithm that is not available in other libraries,
- it consists of the escape character followed by the digit "0" followed by the
- octal character code. For example "\023" represents the character whose octal
- code is 23. Where ambiguity could occur use parentheses to break the expression
- up: "\0103" represents the character whose code is 103, "(\010)3 represents the
- character 10 followed by "3". To match characters by their hexadecimal code,
- use \x followed by a string of hexadecimal digits, optionally enclosed inside
- {}, for example \xf0 or \x{aff}, notice the latter example is a Unicode
- character.
-
Word operators
-
-
The following operators are provided for compatibility with the GNU regular
- expression library.
-
-
"\w" matches any single character that is a member of the "word" character
- class, this is identical to the expression "[[:word:]]".
-
-
"\W" matches any single character that is not a member of the "word" character
- class, this is identical to the expression "[^[:word:]]".
-
-
"\<" matches the null string at the start of a word.
-
-
"\>" matches the null string at the end of the word.
-
-
"\b" matches the null string at either the start or the end of a word.
-
-
"\B" matches a null string within a word.
-
-
The start of the sequence passed to the matching algorithms is considered to be
- a potential start of a word unless the flag match_not_bow is set. The end of
- the sequence passed to the matching algorithms is considered to be a potential
- end of a word unless the flag match_not_eow is set.
-
-
Buffer operators
-
-
The following operators are provided for compatibility with the GNU regular
- expression library, and Perl regular expressions:
-
-
"\`" matches the start of a buffer.
-
-
"\A" matches the start of the buffer.
-
-
"\'" matches the end of a buffer.
-
-
"\z" matches the end of a buffer.
-
-
"\Z" matches the end of a buffer, or possibly one or more new line characters
- followed by the end of the buffer.
-
-
A buffer is considered to consist of the whole sequence passed to the matching
- algorithms, unless the flags match_not_bob or match_not_eob are set.
-
-
Escape operator
-
-
The escape character "\" has several meanings.
-
-
Inside a set declaration the escape character is a normal character unless the
- flag regex_constants::escape_in_lists is set in which case whatever follows the
- escape is a literal character regardless of its normal meaning.
-
-
The escape operator may introduce an operator for example: back references, or
- a word operator.
-
-
The escape operator may make the following character normal, for example "\*"
- represents a literal "*" rather than the repeat operator.
-
-
Single character escape sequences
-
-
The following escape sequences are aliases for single characters:
-
-
-
-
-
-
-
-
Escape sequence
-
-
Character code
-
-
Meaning
-
-
-
-
-
-
\a
-
-
0x07
-
-
Bell character.
-
-
-
-
-
-
\f
-
-
0x0C
-
-
Form feed.
-
-
-
-
-
-
\n
-
-
0x0A
-
-
Newline character.
-
-
-
-
-
-
\r
-
-
0x0D
-
-
Carriage return.
-
-
-
-
-
-
\t
-
-
0x09
-
-
Tab character.
-
-
-
-
-
-
\v
-
-
0x0B
-
-
Vertical tab.
-
-
-
-
-
-
\e
-
-
0x1B
-
-
ASCII Escape character.
-
-
-
-
-
-
\0dd
-
-
0dd
-
-
An octal character code, where dd is one or
- more octal digits.
-
-
-
-
-
-
\xXX
-
-
0xXX
-
-
A hexadecimal character code, where XX is one or more
- hexadecimal digits.
-
-
-
-
-
-
\x{XX}
-
-
0xXX
-
-
A hexadecimal character code, where XX is one or more
- hexadecimal digits, optionally a Unicode character.
-
-
-
-
-
-
\cZ
-
-
z-@
-
-
An ASCII escape sequence control-Z, where Z is any
- ASCII character greater than or equal to the character code for '@'.
-
-
-
-
-
-
Miscellaneous escape sequences:
-
-
The following are provided mostly for perl compatibility, but note that there
- are some differences in the meanings of \l \L \u and \U:
-
-
-
-
-
-
-
-
\w
-
-
Equivalent to [[:word:]].
-
-
-
-
-
-
\W
-
-
Equivalent to [^[:word:]].
-
-
-
-
-
-
\s
-
-
Equivalent to [[:space:]].
-
-
-
-
-
-
\S
-
-
Equivalent to [^[:space:]].
-
-
-
-
-
-
\d
-
-
Equivalent to [[:digit:]].
-
-
-
-
-
-
\D
-
-
Equivalent to [^[:digit:]].
-
-
-
-
-
-
\l
-
-
Equivalent to [[:lower:]].
-
-
-
-
-
-
\L
-
-
Equivalent to [^[:lower:]].
-
-
-
-
-
-
\u
-
-
Equivalent to [[:upper:]].
-
-
-
-
-
-
\U
-
-
Equivalent to [^[:upper:]].
-
-
-
-
-
-
\C
-
-
Any single character, equivalent to '.'.
-
-
-
-
-
-
\X
-
-
Match any Unicode combining character sequence, for
- example "a\x 0301" (a letter a with an acute).
-
-
-
-
-
-
\Q
-
-
The begin quote operator, everything that follows is
- treated as a literal character until a \E end quote operator is found.
-
-
-
-
-
-
\E
-
-
The end quote operator, terminates a sequence begun
- with \Q.
-
-
-
-
-
-
What gets matched?
-
-
- When the expression is compiled as a Perl-compatible regex then the matching
- algorithms will perform a depth first search on the state machine and report
- the first match found.
-
- When the expression is compiled as a POSIX-compatible regex then the matching
- algorithms will match the first possible matching string, if more than one
- string starting at a given location can match then it matches the longest
- possible string, unless the flag match_any is set, in which case the first
- match encountered is returned. Use of the match_any option can reduce the time
- taken to find the match - but is only useful if the user is less concerned
- about what matched - for example it would not be suitable for search and
- replace operations. In cases where their are multiple possible matches all
- starting at the same location, and all of the same length, then the match
- chosen is the one with the longest first sub-expression, if that is the same
- for two or more matches, then the second sub-expression will be examined and so
- on.
-
-
- The following table examples illustrate the main differences between Perl and
- POSIX regular expression matching rules:
-
-
-
-
-
-
-
Expression
-
-
-
Text
-
-
-
POSIX leftmost longest match
-
-
-
ECMAScript depth first search match
-
-
-
-
-
a|ab
-
-
-
xaby
-
-
-
-
"ab"
-
-
-
"a"
-
-
-
-
-
.*([[:alnum:]]+).*
-
-
-
" abc def xyz "
-
-
-
$0 = " abc def xyz "
- $1 = "abc"
-
-
-
$0 = " abc def xyz "
- $1 = "z"
-
-
-
-
-
.*(a|xayy)
-
-
-
zzxayyzz
-
-
-
"zzxayy"
-
-
-
"zzxa"
-
-
-
-
These differences between Perl matching rules, and POSIX matching rules, mean
- that these two regular expression syntaxes differ not only in the features
- offered, but also in the form that the state machine takes and/or the
- algorithms used to traverse the state machine.
+ depend upon the flags used during
+ expression compilation.
+
+
There are three main syntax options available, depending upon how
+ you construct the regular expression object:
The POSIX-Basic regular expression syntax is used by the Unix utility sed,
+ and variations are used by grep and emacs. You can
+ construct POSIX basic regular expressions in Boost.Regex by passing the flag basic
+ to the regex constructor, for example:
+
// e1 is a case sensitive POSIX-Basic expression:
+boost::regex e1(my_expression, boost::regex::basic);
+// e2 a case insensitive POSIX-Basic expression:
+boost::regex e2(my_expression, boost::regex::basic|boost::regex::icase);
+
POSIX Basic Syntax
+
In POSIX-Basic regular expressions, all characters are match themselves except
+ for the following special characters:
+
.[\*^$
+
Wildcard:
+
The single character '.' when used outside of a character set will match any
+ single character except:
+
The NULL character when the flag match_no_dot_null is passed to the
+ matching algorithms.
+
The newline character when the flag match_not_dot_newline is passed to
+ the matching algorithms.
+
Anchors:
+
A '^' character shall match the start of a line when used as the first
+ character of an expression, or the first character of a sub-expression.
+
A '$' character shall match the end of a line when used as the last character
+ of an expression, or the last character of a sub-expression.
+
Marked sub-expressions:
+
A section beginning \( and ending \) acts as a marked sub-expression.
+ Whatever matched the sub-expression is split out in a separate field by the
+ matching algorithms. Marked sub-expressions can also repeated, or referred-to by a back-reference.
+
Repeats:
+
Any atom (a single character, a marked sub-expression, or a character class)
+ can be repeated with the * operator.
+
For example a* will match any number of letter a's repeated zero or more times
+ (an atom repeated zero times matches an empty string), so the expression a*b
+ will match any of the following:
+
b
+ab
+aaaaaaaab
+
An atom can also be repeated with a bounded repeat:
+
a\{n\} Matches 'a' repeated exactly n times.
+
a\{n,\} Matches 'a' repeated n or more times.
+
a\{n, m\} Matches 'a' repeated between n and m times
+ inclusive.
+
For example:
+
^a\{2,3\}$
+
Will match either of:
+
aa
+aaa
+
But neither of:
+
a
+aaaa
+
It is an error to use a repeat operator, if the preceding construct can not be
+ repeated, for example:
+
a\(*\)
+
Will raise an error, as there is nothing for the * operator to be applied to.
+
Back references:
+
An escape character followed by a digit n, where n is in the
+ range 1-9, matches the same string that was matched by sub-expression n.
+ For example the expression:
+
^\(a*\).*\1$
+
Will match the string:
+
aaabbaaa
+
But not the string:
+
aaabba
+
Character sets:
+
A character set is a bracket-expression starting with [ and ending with ], it
+ defines a set of characters, and matches any single character that is a member
+ of that set.
+
A bracket expression may contain any combination of the following:
+
+
Single characters:
+
For example [abc], will match any of the characters 'a', 'b', or 'c'.
+
Character ranges:
+
For example [a-c] will match any single character in the range 'a' to
+ 'c'. By default, for POSIX-Basic regular expressions, a character x
+ is within the range y to z, if it collates within that
+ range; this results in locale specific behavior. This behavior can
+ be turned off by unsetting the collate
+ option flag - in which case whether a character appears within a range is
+ determined by comparing the code points of the characters only
+
Negation:
+
If the bracket-expression begins with the ^ character, then it matches the
+ complement of the characters it contains, for example [^a-c] matches any
+ character that is not in the range a-c.
+
Character classes:
+
An expression of the form [[:name:]] matches the named character class "name",
+ for example [[:lower:]] matches any lower case character. See
+ character class names.
+
Collating Elements:
+
An expression of the form [[.col.] matches the collating element col.
+ A collating element is any single character, or any sequence of characters that
+ collates as a single unit. Collating elements may also be used as the end
+ point of a range, for example: [[.ae.]-c] matches the character sequence "ae",
+ plus any single character in the rangle "ae"-c, assuming that "ae" is treated
+ as a single collating element in the current locale.
+
As an extension, a collating element may also be specified via its
+ symbolic name, for example:
+
[[.NUL.]]
+
matches a NUL character.
+
Equivalence classes:
+
+ An expression of the form [[=col=]], matches any character or collating element
+ whose primary sort key is the same as that for collating element col,
+ as with collating elements the name col may be a
+ symbolic name. A primary sort key is one that ignores case,
+ accentation, or locale-specific tailorings; so for example [[=a=]] matches any
+ of the characters: a, à, á, â, ã, ä, å, A, À, Á, Â, Ã, Ä and Å.
+ Unfortunately implementation of this is reliant on the platform's collation and
+ localisation support; this feature can not be relied upon to work portably
+ across all platforms, or even all locales on one platform.
+
+
Combinations:
+
All of the above can be combined in one character set declaration, for example:
+ [[:digit:]a-c[.NUL.]].
+
Escapes
+
With the exception of the escape sequences \{, \}, \(, and \), which are
+ documented above, an escape followed by any character matches that
+ character. This can be used to make the special characters .[\*^$,
+ "ordinary". Note that the escape character loses its special meaning
+ inside a character set, so [\^] will match either a literal '\' or a '^'.
+
Variations
+
Grep
+
When an expression is compiled with the flag grep set, then the
+ expression is treated as a newline separated list of POSIX-Basic
+ expressions, a match is found if any of the expressions in the list match, for
+ example:
+
boost::regex e("abc\ndef", boost::regex::grep);
+
will match either of the POSIX-Basic expressions "abc" or "def".
+
As its name suggests, this behavior is consistent with the Unix utility grep.
+
emacs
+
In addition to the POSIX-Basic features the following
+ characters are also special:
+
+
+ repeats the preceding atom one or more times.
+
? repeats the preceding atom zero or one times.
+
*? A non-greedy version of *.
+
+? A non-greedy version of +.
+
?? A non-greedy version of ?.
+
+
And the following escape sequences are also recognised:
+
+
\| specifies an alternative.
+
\(?: ... \) is a non-marking grouping construct - allows you to
+ lexically group something without spitting out an extra sub-expression.
+
\w matches any word character.
+
\W matches any non-word character.
+
\sx matches any character in the syntax group x, the following emacs
+ groupings are supported: 's', ' ', '_', 'w', '.', ')', '(', '"', '\'', '>'
+ and '<'. Refer to the emacs docs for details.
+
\Sx matches any character not in the syntax grouping x.
+
\c and \C are not supported.
+
\` matches zero characters only at the start of a buffer (or string being
+ matched).
+
\' matches zero characters only at the end of a buffer (or string being
+ matched).
+
\b matches zero characters at a word boundary.
+
\B matches zero characters, not at a word boundary.
+
\< matches zero characters only at the start of a word.
+
\> matches zero characters only at the end of a word.
The POSIX-Extended regular expression syntax is supported by the POSIX C
+ regular expression API's, and variations are used by the utilities egrep
+ and awk. You can construct POSIX extended regular expressions in
+ Boost.Regex by passing the flag extended to the regex constructor, for
+ example:
+
// e1 is a case sensitive POSIX-Extended expression:
+boost::regex e1(my_expression, boost::regex::extended);
+// e2 a case insensitive POSIX-Extended expression:
+boost::regex e2(my_expression, boost::regex::extended|boost::regex::icase);
+
POSIX Extended Syntax
+
In POSIX-Extended regular expressions, all characters match themselves except
+ for the following special characters:
+
.[{()\*+?|^$
+
Wildcard:
+
The single character '.' when used outside of a character set will match any
+ single character except:
+
The NULL character when the flag match_no_dot_null is passed to the
+ matching algorithms.
+
The newline character when the flag match_not_dot_newline is passed to
+ the matching algorithms.
+
Anchors:
+
A '^' character shall match the start of a line when used as the first
+ character of an expression, or the first character of a sub-expression.
+
A '$' character shall match the end of a line when used as the last character
+ of an expression, or the last character of a sub-expression.
+
Marked sub-expressions:
+
A section beginning ( and ending ) acts as a marked sub-expression.
+ Whatever matched the sub-expression is split out in a separate field by the
+ matching algorithms. Marked sub-expressions can also repeated, or referred
+ to by a back-reference.
+
Repeats:
+
Any atom (a single character, a marked sub-expression, or a character class)
+ can be repeated with the *, +, ?, and {} operators.
+
The * operator will match the preceding atom zero or more times, for example
+ the expression a*b will match any of the following:
+
b
+ab
+aaaaaaaab
+
The + operator will match the preceding atom one or more times, for example
+ the expression a+b will match any of the following:
+
ab
+aaaaaaaab
+
But will not match:
+
b
+
The ? operator will match the preceding atom zero or one times, for
+ example the expression ca?b will match any of the following:
+
cb
+cab
+
But will not match:
+
caab
+
An atom can also be repeated with a bounded repeat:
+
a{n} Matches 'a' repeated exactly n times.
+
a{n,} Matches 'a' repeated n or more times.
+
a{n, m} Matches 'a' repeated between n and m times
+ inclusive.
+
For example:
+
^a{2,3}$
+
Will match either of:
+
aa
+aaa
+
But neither of:
+
a
+aaaa
+
It is an error to use a repeat operator, if the preceding construct can not be
+ repeated, for example:
+
a(*)
+
Will raise an error, as there is nothing for the * operator to be applied to.
+
Back references:
+
An escape character followed by a digit n, where n is in the
+ range 1-9, matches the same string that was matched by sub-expression n.
+ For example the expression:
+
^(a*).*\1$
+
Will match the string:
+
aaabbaaa
+
But not the string:
+
aaabba
+
Caution: the POSIX standard does not support back-references
+ for "extended" regular expressions, this is a compatible extension to that
+ standard.
+
Alternation
+
The | operator will match either of its arguments, so for example: abc|def will
+ match either "abc" or "def".
+
+
Parenthesis can be used to group alternations, for example: ab(d|ef) will match
+ either of "abd" or "abef".
+
Character sets:
+
A character set is a bracket-expression starting with [ and ending with ], it
+ defines a set of characters, and matches any single character that is a member
+ of that set.
+
A bracket expression may contain any combination of the following:
+
+
Single characters:
+
For example [abc], will match any of the characters 'a', 'b', or 'c'.
+
Character ranges:
+
For example [a-c] will match any single character in the range 'a' to
+ 'c'. By default, for POSIX-Extended regular expressions, a character x
+ is within the range y to z, if it collates within that
+ range; this results in locale specific behavior. This behavior can
+ be turned off by unsetting the collate
+ option flag - in which case whether a character appears within a range is
+ determined by comparing the code points of the characters only
+
Negation:
+
If the bracket-expression begins with the ^ character, then it matches the
+ complement of the characters it contains, for example [^a-c] matches any
+ character that is not in the range a-c.
+
Character classes:
+
An expression of the form [[:name:]] matches the named character class "name",
+ for example [[:lower:]] matches any lower case character. See
+ character class names.
+
Collating Elements:
+
An expression of the form [[.col.] matches the collating element col.
+ A collating element is any single character, or any sequence of characters that
+ collates as a single unit. Collating elements may also be used as the end
+ point of a range, for example: [[.ae.]-c] matches the character sequence "ae",
+ plus any single character in the range "ae"-c, assuming that "ae" is treated
+ as a single collating element in the current locale.
+
As an extension, a collating element may also be specified via its
+ symbolic name, for example:
+
[[.NUL.]]
+
matches a NUL character.
+
Equivalence classes:
+
+ An expression of the form [[=col=]], matches any character or collating element
+ whose primary sort key is the same as that for collating element col,
+ as with colating elements the name col may be a
+ symbolic name. A primary sort key is one that ignores case,
+ accentation, or locale-specific tailorings; so for example [[=a=]] matches any
+ of the characters: a, à, á, â, ã, ä, å, A, À, Á, Â, Ã, Ä and Å.
+ Unfortunately implementation of this is reliant on the platform's collation and
+ localisation support; this feature can not be relied upon to work portably
+ across all platforms, or even all locales on one platform.
+
+
Combinations:
+
All of the above can be combined in one character set declaration, for example:
+ [[:digit:]a-c[.NUL.]].
+
Operator precedence
+
The order of precedence for of operators is as shown in the following
+ table:
+
+
+
+
Collation-related bracket symbols
+
[==] [::] [..]
+
+
+
Escaped characters
+
+
\
+
+
+
Character set (bracket expression)
+
+
[]
+
+
+
Grouping
+
()
+
+
+
Single-character-ERE duplication
+
+
* + ? {m,n}
+
+
+
Concatenation
+
+
+
+
Anchoring
+
^$
+
+
+
Alternation
+
|
+
+
+
+
Escapes
+
The POSIX standard defines no escape sequences for POSIX-Extended regular
+ expressions, except that:
+
+
+ Any special character preceded by an escape shall match itself.
+
+ The effect of any ordinary character being preceded by an escape is undefined.
+
+ An escape inside a character class declaration shall match itself (in other
+ words the escape character is not "special" inside a character class
+ declaration).
+
However, that's rather restrictive, so the following standard-compatible
+ extensions are also supported by Boost.Regex:
+
+
Escapes matching a specific character
+
The following escape sequences are all synonyms for single characters:
+
+
+
+
Escape
+
Character
+
+
+
\a
+
'\a'
+
+
+
\e
+
0x1B
+
+
+
\f
+
\f
+
+
+
\n
+
\n
+
+
+
\r
+
\r
+
+
+
\t
+
\t
+
+
+
\v
+
\v
+
+
+
\b
+
\b (but only inside a character class declaration).
+
+
+
\cX
+
An ASCII escape sequence - the character whose code point is X % 32
+
+
+
\xdd
+
A hexadecimal escape sequence - matches the single character whose code point
+ is 0xdd.
+
+
+
\x{dddd}
+
A hexadecimal escape sequence - matches the single character whose code point
+ is 0xdddd.
+
+
+
\0ddd
+
An octal escape sequence - matches the single character whose code point is
+ 0ddd.
+
+
+
+
"Single character" character classes:
+
Any escaped character x, if x is the name of a character
+ class shall match any character that is a member of that class, and any escaped
+ character X, if x is the name of a character class, shall
+ match any character not in that class.
+
The following are supported by default:
+
+
+
+
Escape sequence
+
Equivalent to
+
+
+
\d
+
[[:digit:]]
+
+
+
\l
+
[[:lower:]]
+
+
+
\s
+
[[:space:]]
+
+
+
\u
+
[[:upper:]]
+
+
+
\w
+
[[:word:]]
+
+
+
\D
+
[^[:digit:]]
+
+
+
\L
+
[^[:lower:]]
+
+
+
\S
+
[^[:space:]]
+
+
+
\U
+
[^[:upper:]]
+
+
+
\W
+
[^[:word:]]
+
+
+
+
Word Boundaries
+
The following escape sequences match the boundaries of words:
+
+
+
+
\<
+
Matches the start of a word.
+
+
+
\>
+
Matches the end of a word.
+
+
+
\b
+
Matches a word boundary (the start or end of a word).
+
+
+
\B
+
Matches only when not at a word boundary.
+
+
+
+
Buffer boundaries
+
The following match only at buffer boundaries: a "buffer" in this context is
+ the whole of the input text that is being matched against (note that ^ and
+ $ may match embedded newlines within the text).
+
+
+
+
\`
+
Matches at the start of a buffer only.
+
+
+
\'
+
Matches at the end of a buffer only.
+
+
+
\A
+
Matches at the start of a buffer only (the same as \`).
+
+
+
\z
+
Matches at the end of a buffer only (the same as \').
+
+
+
\Z
+
Matches an optional sequence of newlines at the end of a buffer: equivalent to
+ the regular expression \n*\z
+
+
+
+
Continuation Escape
+
The sequence \G matches only at the end of the last match found, or at the
+ start of the text being matched if no previous match was found. This
+ escape useful if you're iterating over the matches contained within a text, and
+ you want each subsequence match to start where the last one ended.
+
Quoting escape
+
The escape sequence \Q begins a "quoted sequence": all the subsequent
+ characters are treated as literals, until either the end of the regular
+ expression or \E is found. For example the expression: \Q\*+\Ea+ would
+ match either of:
+
\*+a \*+aaa
+
Unicode escapes
+
+
+
+
\C
+
Matches a single code point: in Boost regex this has exactly the same effect
+ as a "." operator.
+
+
+
\X
+
Matches a combining character sequence: that is any non-combining character
+ followed by a sequence of zero or more combining characters.
+
+
+
+
Any other escape
+
Any other escape sequence matches the character that is escaped, for example \@
+ matches a literal '@'.
+
+
Variations
+
Egrep
+
When an expression is compiled with the flag egrep set, then the
+ expression is treated as a newline separated list of POSIX-Extended
+ expressions, a match is found if any of the expressions in the list match, for
+ example:
+
boost::regex e("abc\ndef", boost::regex::egrep);
+
will match either of the POSIX-Basic expressions "abc" or "def".
+
As its name suggests, this behavior is consistent with the Unix utility egrep,
+ and with grep when used with the -E option.
+
awk
+
In addition to the POSIX-Extended features the
+ escape character is special inside a character class declaration.
+
In addition, some escape sequences that are not defined as part of
+ POSIX-Extended specification are required to be supported - however Boost.Regex
+ supports these by default anyway.
+
Options
+
There are a variety of flags that
+ may be combined with the extended and egrep options when
+ constructing the regular expression, in particular note that the
+ newline_alt option alters the syntax, while the
+ collate, nosubs and icase options modify how the case and locale
+ sensitivity are to be applied.
Normally Boost.Regex behaves as if the Perl m-modifier is on: so the
assertions ^ and $ match after and before embedded newlines respectively,
- setting this flags is eqivalent to prefixing the expression with (?-m).
+ setting this flags is equivalent to prefixing the expression with (?-m).
In addition some perl-style escape sequences are supported (actually the awk
syntax requires \a \b \t \v \f \n and \r to be recognised, but other
- escape sequences invoke undefined behaviour according to the POSIX standard).
+ escape sequences invoke undefined behavior according to the POSIX standard).
Specifies that the grammar recognized by the regular expression engine is the
- same as that used by POSIX basic regular expressions in IEEE Std 1003.1-2001,
- Portable Operating System Interface (POSIX ), Base Definitions and Headers,
- Section 9, Regular Expressions (FWD.1).
+ same as that used by POSIX basic regular
+ expressions in IEEE Std 1003.1-2001, Portable Operating System Interface
+ (POSIX ), Base Definitions and Headers, Section 9, Regular Expressions (FWD.1).
Specifies that the grammar recognized by the regular expression engine is the
- same as that used by POSIX utility grep in IEEE Std 1003.1-2001, Portable
- Operating System Interface (POSIX ), Shells and Utilities, Section 4,
- Utilities, grep (FWD.1).
+ same as that used by POSIX utility grep in
+ IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and
+ Utilities, Section 4, Utilities, grep (FWD.1).
That is to say, the same as POSIX basic syntax, but with the newline character
- acting as an alternation character in addition to "|".
+ acting as an alternation character; the expression is treated as a newline
+ separated list of alternatives.
+
+
emacs
+
No
+
Specifies that the grammar recognised is the superset of the POSIX-Basic
+ syntax used by the emacs program.
+
The following options may also be set when using POSIX basic regular
@@ -390,7 +397,10 @@ static const syntax_option_type collate;
collate
Yes
-
Specifies that character ranges of the form "[a-b]" should be locale sensitive.
+
Specifies that character ranges of the form "[a-b]" should be locale
+ sensitive. This bit ison by default for
+ POSIX-Basic regular expressions, but can be unset to force ranges to be
+ compared by code point only.
Specifies that the \n character has the same effect as the alternation
operator |. Allows newline separated lists to be used as a list of
- alternatives.
+ alternatives. This bit is already set, if you use the grep option.