forked from boostorg/regex
Initial commit of quickbook conversion of docs.
[SVN r37942]
This commit is contained in:
422
doc/syntax_extended.qbk
Normal file
422
doc/syntax_extended.qbk
Normal file
@ -0,0 +1,422 @@
|
||||
|
||||
[section:basic_extended POSIX Extended Regular Expression Syntax]
|
||||
|
||||
[h3 Synopsis]
|
||||
|
||||
The POSIX-Extended regular expression syntax is supported by the POSIX
|
||||
C regular expression API's, and variations are used by the utilities
|
||||
`egrep` and `awk`. You can construct POSIX extended regular expressions in
|
||||
Boost.Regex by passing the flag `extended` to the regex constructor, for example:
|
||||
|
||||
// e1 is a case sensitive POSIX-Extended expression:
|
||||
boost::regex e1(my_expression, boost::regex::extended);
|
||||
// e2 a case insensitive POSIX-Extended expression:
|
||||
boost::regex e2(my_expression, boost::regex::extended|boost::regex::icase);
|
||||
|
||||
[#boost_regex.posix_extended_syntax][h3 POSIX Extended Syntax]
|
||||
|
||||
In POSIX-Extended regular expressions, all characters match themselves except for
|
||||
the following special characters:
|
||||
|
||||
[pre .\[{()\\\*+?|^$]
|
||||
|
||||
[h4 Wildcard:]
|
||||
|
||||
The single character '.' when used outside of a character set will match
|
||||
any single character except:
|
||||
|
||||
* The NULL character when the flag `match_no_dot_null` is passed to the
|
||||
matching algorithms.
|
||||
* The newline character when the flag `match_not_dot_newline` is passed
|
||||
to the matching algorithms.
|
||||
|
||||
[h4 Anchors:]
|
||||
|
||||
A '^' character shall match the start of a line when used as the first
|
||||
character of an expression, or the first character of a sub-expression.
|
||||
|
||||
A '$' character shall match the end of a line when used as the
|
||||
last character of an expression, or the last character of a sub-expression.
|
||||
|
||||
[h4 Marked sub-expressions:]
|
||||
|
||||
A section beginning `(` and ending `)` acts as a marked sub-expression.
|
||||
Whatever matched the sub-expression is split out in a separate field
|
||||
by the matching algorithms. Marked sub-expressions can also repeated,
|
||||
or referred to by a back-reference.
|
||||
|
||||
[h4 Repeats:]
|
||||
|
||||
Any atom (a single character, a marked sub-expression, or a character class)
|
||||
can be repeated with the `*`, `+`, `?`, and `{}` operators.
|
||||
|
||||
The `*` operator will match the preceding atom /zero or more times/, for
|
||||
example the expression `a*b` will match any of the following:
|
||||
|
||||
[pre
|
||||
b
|
||||
ab
|
||||
aaaaaaaab
|
||||
]
|
||||
|
||||
The `+` operator will match the preceding atom /one or more times/,
|
||||
for example the expression a+b will match any of the following:
|
||||
|
||||
[pre
|
||||
ab
|
||||
aaaaaaaab
|
||||
]
|
||||
|
||||
But will not match:
|
||||
|
||||
[pre
|
||||
b
|
||||
]
|
||||
|
||||
The `?` operator will match the preceding atom /zero or one times/, for
|
||||
example the expression `ca?b` will match any of the following:
|
||||
|
||||
[pre
|
||||
cb
|
||||
cab
|
||||
]
|
||||
But will not match:
|
||||
|
||||
[pre
|
||||
caab
|
||||
]
|
||||
|
||||
An atom can also be repeated with a bounded repeat:
|
||||
|
||||
`a{n}` Matches 'a' repeated /exactly n times/.
|
||||
|
||||
`a{n,}` Matches 'a' repeated /n or more times/.
|
||||
|
||||
`a{n, m}` Matches 'a' repeated /between n and m times inclusive/.
|
||||
|
||||
For example:
|
||||
|
||||
[pre ^a{2,3}\$]
|
||||
|
||||
Will match either of:
|
||||
|
||||
aa
|
||||
aaa
|
||||
|
||||
But neither of:
|
||||
|
||||
a
|
||||
aaaa
|
||||
|
||||
It is an error to use a repeat operator, if the preceding construct can not
|
||||
be repeated, for example:
|
||||
|
||||
a(*)
|
||||
|
||||
Will raise an error, as there is nothing for the `*` operator to be applied to.
|
||||
|
||||
[h4 Back references:]
|
||||
|
||||
An escape character followed by a digit /n/, where /n/ is in the range 1-9,
|
||||
matches the same string that was matched by sub-expression /n/. For example
|
||||
the expression:
|
||||
|
||||
[pre ^(a\*).\*\\1\$]
|
||||
|
||||
Will match the string:
|
||||
|
||||
aaabbaaa
|
||||
|
||||
But not the string:
|
||||
|
||||
aaabba
|
||||
|
||||
[caution The POSIX standard does not support back-references for "extended"
|
||||
regular expressions, this is a compatible extension to that standard.]
|
||||
|
||||
[h4 Alternation]
|
||||
|
||||
The `|` operator will match either of its arguments, so for example:
|
||||
`abc|def` will match either "abc" or "def".
|
||||
|
||||
Parenthesis can be used to group alternations, for example: `ab(d|ef)`
|
||||
will match either of "abd" or "abef".
|
||||
|
||||
[h4 Character sets:]
|
||||
|
||||
A character set is a bracket-expression starting with \[ and ending with \],
|
||||
it defines a set of characters, and matches any single character that is
|
||||
a member of that set.
|
||||
|
||||
A bracket expression may contain any combination of the following:
|
||||
|
||||
[h5 Single characters:]
|
||||
|
||||
For example `[abc]`, will match any of the characters 'a', 'b', or 'c'.
|
||||
|
||||
[h5 Character ranges:]
|
||||
|
||||
For example `[a-c]` will match any single character in the range 'a' to 'c'.
|
||||
By default, for POSIX-Extended regular expressions, a character /x/ is
|
||||
within the range /y/ to /z/, if it collates within that range; this
|
||||
results in locale specific behavior . This behavior can be turned
|
||||
off by unsetting the `collate`
|
||||
[link boost_regex.ref.syntax_option_type option flag] - in which case whether
|
||||
a character appears within a range is determined by comparing the code
|
||||
points of the characters only.
|
||||
|
||||
[h5 Negation:]
|
||||
|
||||
If the bracket-expression begins with the ^ character, then it matches the
|
||||
complement of the characters it contains, for example `[^a-c]` matches
|
||||
any character that is not in the range `a-c`.
|
||||
|
||||
[h5 Character classes:]
|
||||
|
||||
An expression of the form `[[:name:]]` matches the named character class "name",
|
||||
for example `[[:lower:]]` matches any lower case character.
|
||||
See [link boost_regex.syntax.character_classes character class names].
|
||||
|
||||
[h5 Collating Elements:]
|
||||
|
||||
An expression of the form `[[.col.]` matches the collating element /col/.
|
||||
A collating element is any single character, or any sequence of
|
||||
characters that collates as a single unit. Collating elements may
|
||||
also be used as the end point of a range, for example: `[[.ae.]-c]`
|
||||
matches the character sequence "ae", plus any single character
|
||||
in the range "ae"-c, assuming that "ae" is treated as a single
|
||||
collating element in the current locale.
|
||||
|
||||
Collating elements may be used in place of escapes (which are not
|
||||
normally allowed inside character sets), for example `[[.^.]abc]`
|
||||
would match either one of the characters 'abc^'.
|
||||
|
||||
As an extension, a collating element may also be specified via its
|
||||
[link boost_regex.syntax.collating_names symbolic name], for example:
|
||||
|
||||
[[.NUL.]]
|
||||
|
||||
matches a NUL character.
|
||||
|
||||
[h5 Equivalence classes:]
|
||||
|
||||
An expression of the form `[[=col=]]`, matches any character or collating element
|
||||
whose primary sort key is the same as that for collating element /col/,
|
||||
as with colating elements the name /col/ may be a
|
||||
[link boost_regex.syntax.collating_names symbolic name]. A primary
|
||||
sort key is one that ignores case, accentation, or locale-specific tailorings;
|
||||
so for example `[[=a=]]` matches any of the characters:
|
||||
a, '''À''', '''Á''', '''Â''',
|
||||
'''Ã''', '''Ä''', '''Å''', A, '''à''', '''á''',
|
||||
'''â''', '''ã''', '''ä''' and '''å'''.
|
||||
Unfortunately implementation of this is reliant on the platform's
|
||||
collation and localisation support; this feature can not be relied
|
||||
upon to work portably across all platforms, or even all locales on one platform.
|
||||
|
||||
[h5 Combinations:]
|
||||
|
||||
All of the above can be combined in one character set declaration,
|
||||
for example: `[[:digit:]a-c[.NUL.]]`.
|
||||
|
||||
[h4 Escapes]
|
||||
|
||||
The POSIX standard defines no escape sequences for POSIX-Extended
|
||||
regular expressions, except that:
|
||||
|
||||
* Any special character preceded by an escape shall match itself.
|
||||
* The effect of any ordinary character being preceded by an escape is undefined.
|
||||
* An escape inside a character class declaration shall match itself: in
|
||||
other words the escape character is not "special" inside a character
|
||||
class declaration; so `[\^]` will match either a literal '\\' or a '^'.
|
||||
|
||||
However, that's rather restrictive, so the following standard-compatible
|
||||
extensions are also supported by Boost.Regex:
|
||||
|
||||
[h5 Escapes matching a specific character]
|
||||
|
||||
The following escape sequences are all synonyms for single characters:
|
||||
|
||||
[table
|
||||
[[Escape][Character]]
|
||||
[[\\a]['\\a']]
|
||||
[[\\e][0x1B]]
|
||||
[[\\f][\\f]]
|
||||
[[\\n][\\n]]
|
||||
[[\\r][\\r]]
|
||||
[[\\t][\\t]]
|
||||
[[\\v][\\v]]
|
||||
[[\\b][\\b (but only inside a character class declaration).]]
|
||||
[[\\cX][An ASCII escape sequence - the character whose code point is X % 32]]
|
||||
[[\\xdd][A hexadecimal escape sequence - matches the single character whose code point is 0xdd.]]
|
||||
[[\\x{dddd}][A hexadecimal escape sequence - matches the single character whose code point is 0xdddd.]]
|
||||
[[\\0ddd][An octal escape sequence - matches the single character whose code point is 0ddd.]]
|
||||
[[\\N{Name}][Matches the single character which has the symbolic name name. For example `\\N{newline}` matches the single character \\n.]]
|
||||
]
|
||||
|
||||
[h5 "Single character" character classes:]
|
||||
|
||||
Any escaped character /x/, if /x/ is the name of a character class shall
|
||||
match any character that is a member of that class, and any
|
||||
escaped character /X/, if /x/ is the name of a character class,
|
||||
shall match any character not in that class.
|
||||
|
||||
The following are supported by default:
|
||||
|
||||
[table
|
||||
[[Escape sequence][Equivalent to]]
|
||||
[[`\d`][`[[:digit:]]`]]
|
||||
[[`\l`][`[[:lower:]]`]]
|
||||
[[`\s`][`[[:space:]]`]]
|
||||
[[`\u`][`[[:upper:]]`]]
|
||||
[[`\w`][`[[:word:]]`]]
|
||||
[[`\D`][`[^[:digit:]]`]]
|
||||
[[`\L`][`[^[:lower:]]`]]
|
||||
[[`\S`][`[^[:space:]]`]]
|
||||
[[`\U`][`[^[:upper:]]`]]
|
||||
[[`\W`][`[^[:word:]]`]]
|
||||
]
|
||||
|
||||
[h5 Character Properties]
|
||||
|
||||
The character property names in the following table are all equivalent to the
|
||||
names used in character classes.
|
||||
|
||||
[table
|
||||
[[Form][Description][Equivalent character set form]]
|
||||
[[`\pX`][Matches any character that has the property X.][`[[:X:]]`]]
|
||||
[[`\p{Name}`][Matches any character that has the property Name.][`[[:Name:]]`]]
|
||||
[[`\PX`][Matches any character that does not have the property X.][`[^[:X:]]`]]
|
||||
[[`\P{Name}`][Matches any character that does not have the property Name.][`[^[:Name:]]`]]
|
||||
]
|
||||
|
||||
For example `\pd` matches any "digit" character, as does `\p{digit}`.
|
||||
|
||||
[h5 Word Boundaries]
|
||||
|
||||
The following escape sequences match the boundaries of words:
|
||||
|
||||
[table
|
||||
[[Escape][Meaning]]
|
||||
[[`\<`][Matches the start of a word.]]
|
||||
[[`\>`][Matches the end of a word.]]
|
||||
[[`\b`][Matches a word boundary (the start or end of a word).]]
|
||||
[[`\B`][Matches only when not at a word boundary.]]
|
||||
]
|
||||
|
||||
[h5 Buffer boundaries]
|
||||
|
||||
The following match only at buffer boundaries: a "buffer" in this
|
||||
context is the whole of the input text that is being matched against
|
||||
(note that ^ and $ may match embedded newlines within the text).
|
||||
|
||||
[table
|
||||
[[Escape][Meaning]]
|
||||
[[\\\`][Matches at the start of a buffer only.]]
|
||||
[[\\'][Matches at the end of a buffer only.]]
|
||||
[[`\A`][Matches at the start of a buffer only (the same as \\\`).]]
|
||||
[[`\z`][Matches at the end of a buffer only (the same as \\').]]
|
||||
[[`\Z`][Matches an optional sequence of newlines at the end of a buffer:
|
||||
equivalent to the regular expression `\n*\z`]]
|
||||
]
|
||||
|
||||
[h5 Continuation Escape]
|
||||
|
||||
The sequence `\G` matches only at the end of the last match found, or at
|
||||
the start of the text being matched if no previous match was found.
|
||||
This escape useful if you're iterating over the matches contained within
|
||||
a text, and you want each subsequence match to start where the last one ended.
|
||||
|
||||
[h5 Quoting escape]
|
||||
|
||||
The escape sequence `\Q` begins a "quoted sequence": all the subsequent
|
||||
characters are treated as literals, until either the end of the
|
||||
regular expression or `\E` is found. For example the expression: `\Q\*+\Ea+`
|
||||
would match either of:
|
||||
|
||||
\*+a
|
||||
\*+aaa
|
||||
|
||||
[h5 Unicode escapes]
|
||||
|
||||
[table
|
||||
[[Escape][Meaning]]
|
||||
[[`\C`][Matches a single code point: in Boost regex this has exactly the same effect as a "." operator.]]
|
||||
[[`\X`][Matches a combining character sequence: that is any non-combining character followed by a sequence of zero or more combining characters.]]
|
||||
]
|
||||
|
||||
[h5 Any other escape]
|
||||
|
||||
Any other escape sequence matches the character that is escaped,
|
||||
for example \\@ matches a literal '@'.
|
||||
|
||||
[h4 Operator precedence]
|
||||
|
||||
The order of precedence for of operators is as follows:
|
||||
|
||||
# Collation-related bracket symbols `[==] [::] [..]`
|
||||
# Escaped characters `\`
|
||||
# Character set (bracket expression) `[]`
|
||||
# Grouping `()`
|
||||
# Single-character-ERE duplication `* + ? {m,n}`
|
||||
# Concatenation
|
||||
# Anchoring ^$
|
||||
# Alternation `|`
|
||||
|
||||
[h4 What Gets Matched]
|
||||
|
||||
When there is more that one way to match a regular expression, the
|
||||
"best" possible match is obtained using the
|
||||
[link boost_regex.syntax.leftmost_longest_rule leftmost-longest rule].
|
||||
|
||||
[h3 Variations]
|
||||
|
||||
[h4 Egrep]
|
||||
|
||||
When an expression is compiled with the
|
||||
[link boost_regex.ref.syntax_option_type flag `egrep`] set, then the
|
||||
expression is treated as a newline separated list of
|
||||
[link boost_regex.posix_extended_syntax POSIX-Extended expressions],
|
||||
a match is found if any of the
|
||||
expressions in the list match, for example:
|
||||
|
||||
boost::regex e("abc\ndef", boost::regex::egrep);
|
||||
|
||||
will match either of the POSIX-Basic expressions "abc" or "def".
|
||||
|
||||
As its name suggests, this behavior is consistent with the Unix utility `egrep`,
|
||||
and with grep when used with the -E option.
|
||||
|
||||
[h4 awk]
|
||||
|
||||
In addition to the
|
||||
[link boost_regex.posix_extended_syntax POSIX-Extended features] the
|
||||
escape character is
|
||||
special inside a character class declaration.
|
||||
|
||||
In addition, some escape sequences that are not defined as part of
|
||||
POSIX-Extended specification are required to be supported - however Boost.Regex
|
||||
supports these by default anyway.
|
||||
|
||||
[h3 Options]
|
||||
|
||||
There are a [link boost_regex.ref.syntax_option_type.syntax_option_type_extended variety of flags]
|
||||
that may be combined with the `extended` and `egrep` options when
|
||||
constructing the regular expression, in particular note that the
|
||||
[link boost_regex.ref.syntax_option_type.syntax_option_type_extended `newline_alt`]
|
||||
option alters the syntax, while the
|
||||
[link boost_regex.ref.syntax_option_type.syntax_option_type_extended `collate`, `nosubs`
|
||||
and `icase` options] modify how the case and locale sensitivity are to be applied.
|
||||
|
||||
[h3 References]
|
||||
|
||||
[@http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap09.html
|
||||
IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and Headers, Section 9, Regular Expressions.]
|
||||
|
||||
[@http://www.opengroup.org/onlinepubs/000095399/utilities/grep.html
|
||||
IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, Utilities, egrep.]
|
||||
|
||||
[@http://www.opengroup.org/onlinepubs/000095399/utilities/awk.html
|
||||
IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, Utilities, awk.]
|
||||
|
||||
[endsect]
|
||||
|
Reference in New Issue
Block a user