forked from boostorg/regex
Initial commit of quickbook conversion of docs.
[SVN r37942]
This commit is contained in:
508
doc/syntax_perl.qbk
Normal file
508
doc/syntax_perl.qbk
Normal file
@ -0,0 +1,508 @@
|
||||
|
||||
[section:perl_syntax Perl Regular Expression Syntax]
|
||||
|
||||
[h3 Synopsis]
|
||||
|
||||
The Perl regular expression syntax is based on that used by the
|
||||
programming language Perl . Perl regular expressions are the
|
||||
default behavior in Boost.Regex or you can pass the flag `perl` to the
|
||||
[basic_regex] constructor, for example:
|
||||
|
||||
// e1 is a case sensitive Perl regular expression:
|
||||
// since Perl is the default option there's no need to explicitly specify the syntax used here:
|
||||
boost::regex e1(my_expression);
|
||||
// e2 a case insensitive Perl regular expression:
|
||||
boost::regex e2(my_expression, boost::regex::perl|boost::regex::icase);
|
||||
|
||||
[h3 Perl Regular Expression Syntax]
|
||||
|
||||
In Perl regular expressions, all characters match themselves except for the
|
||||
following special characters:
|
||||
|
||||
[pre .\[{()\\\*+?|^$]
|
||||
|
||||
[h4 Wildcard]
|
||||
|
||||
The single character '.' when used outside of a character set will match
|
||||
any single character except:
|
||||
|
||||
* The NULL character when the [link boost_regex.ref.match_flag_type flag
|
||||
`match_no_dot_null`] is passed to the matching algorithms.
|
||||
* The newline character when the [link boost_regex.ref.match_flag_type
|
||||
flag `match_not_dot_newline`] is passed to
|
||||
the matching algorithms.
|
||||
|
||||
[h4 Anchors]
|
||||
|
||||
A '^' character shall match the start of a line.
|
||||
|
||||
A '$' character shall match the end of a line.
|
||||
|
||||
[h4 Marked sub-expressions]
|
||||
|
||||
A section beginning `(` and ending `)` acts as a marked sub-expression.
|
||||
Whatever matched the sub-expression is split out in a separate field by
|
||||
the matching algorithms. Marked sub-expressions can also repeated, or
|
||||
referred to by a back-reference.
|
||||
|
||||
[h4 Non-marking grouping]
|
||||
|
||||
A marked sub-expression is useful to lexically group part of a regular
|
||||
expression, but has the side-effect of spitting out an extra field in
|
||||
the result. As an alternative you can lexically group part of a
|
||||
regular expression, without generating a marked sub-expression by using
|
||||
`(?:` and `)` , for example `(?:ab)+` will repeat `ab` without splitting
|
||||
out any separate sub-expressions.
|
||||
|
||||
[h4 Repeats]
|
||||
|
||||
Any atom (a single character, a marked sub-expression, or a character class)
|
||||
can be repeated with the `*`, `+`, `?`, and `{}` operators.
|
||||
|
||||
The `*` operator will match the preceding atom zero or more times,
|
||||
for example the expression `a*b` will match any of the following:
|
||||
|
||||
b
|
||||
ab
|
||||
aaaaaaaab
|
||||
|
||||
The `+` operator will match the preceding atom one or more times, for
|
||||
example the expression `a+b` will match any of the following:
|
||||
|
||||
ab
|
||||
aaaaaaaab
|
||||
|
||||
But will not match:
|
||||
|
||||
b
|
||||
|
||||
The `?` operator will match the preceding atom zero or one times, for
|
||||
example the expression ca?b will match any of the following:
|
||||
|
||||
cb
|
||||
cab
|
||||
|
||||
But will not match:
|
||||
|
||||
caab
|
||||
|
||||
An atom can also be repeated with a bounded repeat:
|
||||
|
||||
`a{n}` Matches 'a' repeated exactly n times.
|
||||
|
||||
`a{n,}` Matches 'a' repeated n or more times.
|
||||
|
||||
`a{n, m}` Matches 'a' repeated between n and m times inclusive.
|
||||
|
||||
For example:
|
||||
|
||||
[pre ^a{2,3}$]
|
||||
|
||||
Will match either of:
|
||||
|
||||
aa
|
||||
aaa
|
||||
|
||||
But neither of:
|
||||
|
||||
a
|
||||
aaaa
|
||||
|
||||
It is an error to use a repeat operator, if the preceding construct can not
|
||||
be repeated, for example:
|
||||
|
||||
a(*)
|
||||
|
||||
Will raise an error, as there is nothing for the `*` operator to be applied to.
|
||||
|
||||
[h4 Non greedy repeats]
|
||||
|
||||
The normal repeat operators are "greedy", that is to say they will consume as
|
||||
much input as possible. There are non-greedy versions available that will
|
||||
consume as little input as possible while still producing a match.
|
||||
|
||||
`*?` Matches the previous atom zero or more times, while consuming as little
|
||||
input as possible.
|
||||
|
||||
`+?` Matches the previous atom one or more times, while consuming as
|
||||
little input as possible.
|
||||
|
||||
`??` Matches the previous atom zero or one times, while consuming
|
||||
as little input as possible.
|
||||
|
||||
`{n,}?` Matches the previous atom n or more times, while consuming as
|
||||
little input as possible.
|
||||
|
||||
`{n,m}?` Matches the previous atom between n and m times, while
|
||||
consuming as little input as possible.
|
||||
|
||||
[h4 Back references]
|
||||
|
||||
An escape character followed by a digit /n/, where /n/ is in the range 1-9,
|
||||
matches the same string that was matched by sub-expression /n/. For example
|
||||
the expression:
|
||||
|
||||
[pre ^(a\*).\*\\1$]
|
||||
|
||||
Will match the string:
|
||||
|
||||
aaabbaaa
|
||||
|
||||
But not the string:
|
||||
|
||||
aaabba
|
||||
|
||||
[h4 Alternation]
|
||||
|
||||
The `|` operator will match either of its arguments, so for example:
|
||||
`abc|def` will match either "abc" or "def".
|
||||
|
||||
Parenthesis can be used to group alternations, for example: `ab(d|ef)`
|
||||
will match either of "abd" or "abef".
|
||||
|
||||
Empty alternatives are not allowed (these are almost always a mistake), but
|
||||
if you really want an empty alternative use `(?:)` as a placeholder, for example:
|
||||
|
||||
`|abc` is not a valid expression, but
|
||||
|
||||
`(?:)|abc` is and is equivalent, also the expression:
|
||||
|
||||
`(?:abc)??` has exactly the same effect.
|
||||
|
||||
[h4 Character sets]
|
||||
|
||||
A character set is a bracket-expression starting with `[` and ending with `]`,
|
||||
it defines a set of characters, and matches any single character that is a
|
||||
member of that set.
|
||||
|
||||
A bracket expression may contain any combination of the following:
|
||||
|
||||
[h5 Single characters]
|
||||
|
||||
For example `[abc]`, will match any of the characters 'a', 'b', or 'c'.
|
||||
|
||||
[h5 Character ranges]
|
||||
|
||||
For example `[a-c]` will match any single character in the range 'a' to 'c'.
|
||||
By default, for Perl regular expressions, a character x is within the
|
||||
range y to z, if the code point of the character lies within the codepoints of
|
||||
the endpoints of the range. Alternatively, if you set the
|
||||
[link boost_regex.ref.syntax_option_type.syntax_option_type_perl `collate` flag]
|
||||
when constructing the regular expression, then ranges are locale sensitive.
|
||||
|
||||
[h5 Negation]
|
||||
|
||||
If the bracket-expression begins with the ^ character, then it matches the
|
||||
complement of the characters it contains, for example `[^a-c]` matches
|
||||
any character that is not in the range `a-c`.
|
||||
|
||||
[h5 Character classes]
|
||||
|
||||
An expression of the form `[[:name:]]` matches the named character class
|
||||
"name", for example `[[:lower:]]` matches any lower case character.
|
||||
See [link boost_regex.syntax.character_classes character class names].
|
||||
|
||||
[h5 Collating Elements]
|
||||
|
||||
An expression of the form `[[.col.]` matches the collating element /col/.
|
||||
A collating element is any single character, or any sequence of characters
|
||||
that collates as a single unit. Collating elements may also be used
|
||||
as the end point of a range, for example: `[[.ae.]-c]` matches the
|
||||
character sequence "ae", plus any single character in the range "ae"-c,
|
||||
assuming that "ae" is treated as a single collating element in the current locale.
|
||||
|
||||
As an extension, a collating element may also be specified via it's
|
||||
[link boost_regex.syntax.collating_names symbolic name], for example:
|
||||
|
||||
[[.NUL.]]
|
||||
|
||||
matches a `\0` character.
|
||||
|
||||
[h5 Equivalence classes]
|
||||
|
||||
An expression of the form `[[=col=]]`, matches any character or collating element
|
||||
whose primary sort key is the same as that for collating element /col/, as with
|
||||
collating elements the name /col/ may be a
|
||||
[link boost_regex.syntax.collating_names symbolic name]. A primary sort key is
|
||||
one that ignores case, accentation, or locale-specific tailorings; so for
|
||||
example `[[=a=]]` matches any of the characters:
|
||||
a, '''À''', '''Á''', '''Â''',
|
||||
'''Ã''', '''Ä''', '''Å''', A, '''à''', '''á''',
|
||||
'''â''', '''ã''', '''ä''' and '''å'''.
|
||||
Unfortunately implementation of this is reliant on the platform's collation
|
||||
and localisation support; this feature can not be relied upon to work portably
|
||||
across all platforms, or even all locales on one platform.
|
||||
|
||||
[h5 Escaped Characters]
|
||||
|
||||
All the escape sequences that match a single character, or a single character
|
||||
class are permitted within a character class definition. For example
|
||||
`[\[\]]` would match either of `[` or `]` while `[\W\d]` would match any character
|
||||
that is either a "digit", /or/ is /not/ a "word" character.
|
||||
|
||||
[h5 Combinations]
|
||||
|
||||
All of the above can be combined in one character set declaration, for example:
|
||||
`[[:digit:]a-c[.NUL.]]`.
|
||||
|
||||
[h4 Escapes]
|
||||
|
||||
Any special character preceded by an escape shall match itself.
|
||||
|
||||
The following escape sequences are all synonyms for single characters:
|
||||
|
||||
[table
|
||||
[[Escape][Character]]
|
||||
[[`\a`][`\a`]]
|
||||
[[`\e`][`0x1B`]]
|
||||
[[`\f`][`\f`]]
|
||||
[[`\n`][`\n`]]
|
||||
[[`\r`][`\r`]]
|
||||
[[`\t`][`\t`]]
|
||||
[[`\v `][`\v`]]
|
||||
[[`\b`][`\b` (but only inside a character class declaration).]]
|
||||
[[`\cX`][An ASCII escape sequence - the character whose code point is X % 32]]
|
||||
[[`\xdd`][A hexadecimal escape sequence - matches the single character whose
|
||||
code point is 0xdd.]]
|
||||
[[`\x{dddd}`][A hexadecimal escape sequence - matches the single character whose
|
||||
code point is 0xdddd.]]
|
||||
[[`\0ddd`][An octal escape sequence - matches the single character whose
|
||||
code point is 0ddd.]]
|
||||
[[`\N{name}`][Matches the single character which has the
|
||||
[link boost_regex.syntax.collating_names symbolic name] /name/.
|
||||
For example `\N{newline}` matches the single character \\n.]]
|
||||
]
|
||||
|
||||
[h5 "Single character" character classes:]
|
||||
|
||||
Any escaped character /x/, if /x/ is the name of a character class shall
|
||||
match any character that is a member of that class, and any
|
||||
escaped character /X/, if /x/ is the name of a character class, shall
|
||||
match any character not in that class.
|
||||
|
||||
The following are supported by default:
|
||||
|
||||
[table
|
||||
[[Escape sequence][Equivalent to]]
|
||||
[[`\d`][`[[:digit:]]`]]
|
||||
[[`\l`][`[[:lower:]]`]]
|
||||
[[`\s`][`[[:space:]]`]]
|
||||
[[`\u`][`[[:upper:]]`]]
|
||||
[[`\w`][`[[:word:]]`]]
|
||||
[[`\D`][`[^[:digit:]]`]]
|
||||
[[`\L`][`[^[:lower:]]`]]
|
||||
[[`\S`][`[^[:space:]]`]]
|
||||
[[`\U`][`[^[:upper:]]`]]
|
||||
[[`\W`][`[^[:word:]]`]]
|
||||
]
|
||||
|
||||
[h5 Character Properties]
|
||||
|
||||
The character property names in the following table are all equivalent
|
||||
to the [link boost_regex.syntax.character_classes names used in character classes].
|
||||
|
||||
[table
|
||||
[[Form][Description][Equivalent character set form]]
|
||||
[[`\pX`][Matches any character that has the property X.][`[[:X:]]`]]
|
||||
[[`\p{Name}`][Matches any character that has the property Name.][`[[:Name:]]`]]
|
||||
[[`\PX`][Matches any character that does not have the property X.][`[^[:X:]]`]]
|
||||
[[`\P{Name}`][Matches any character that does not have the property Name.][`[^[:Name:]]`]]
|
||||
]
|
||||
|
||||
For example `\pd` matches any "digit" character, as does `\p{digit}`.
|
||||
|
||||
[h5 Word Boundaries]
|
||||
|
||||
The following escape sequences match the boundaries of words:
|
||||
|
||||
`\<` Matches the start of a word.
|
||||
|
||||
`\>` Matches the end of a word.
|
||||
|
||||
`\b` Matches a word boundary (the start or end of a word).
|
||||
|
||||
`\B` Matches only when not at a word boundary.
|
||||
|
||||
[h5 Buffer boundaries]
|
||||
|
||||
The following match only at buffer boundaries: a "buffer" in this
|
||||
context is the whole of the input text that is being matched against
|
||||
(note that ^ and $ may match embedded newlines within the text).
|
||||
|
||||
\\\` Matches at the start of a buffer only.
|
||||
|
||||
\\' Matches at the end of a buffer only.
|
||||
|
||||
\\A Matches at the start of a buffer only (the same as \\\`).
|
||||
|
||||
\\z Matches at the end of a buffer only (the same as \\').
|
||||
|
||||
\\Z Matches an optional sequence of newlines at the end of a buffer:
|
||||
equivalent to the regular expression `\n*\z`
|
||||
|
||||
[h5 Continuation Escape]
|
||||
|
||||
The sequence `\G` matches only at the end of the last match found, or at
|
||||
the start of the text being matched if no previous match was found.
|
||||
This escape useful if you're iterating over the matches contained within a
|
||||
text, and you want each subsequence match to start where the last one ended.
|
||||
|
||||
[h5 Quoting escape]
|
||||
|
||||
The escape sequence `\Q` begins a "quoted sequence": all the subsequent characters
|
||||
are treated as literals, until either the end of the regular expression or \\E
|
||||
is found. For example the expression: `\Q\*+\Ea+` would match either of:
|
||||
|
||||
\*+a
|
||||
\*+aaa
|
||||
|
||||
[h5 Unicode escapes]
|
||||
|
||||
`\C` Matches a single code point: in Boost regex this has exactly the
|
||||
same effect as a "." operator.
|
||||
`\X` Matches a combining character sequence: that is any non-combining
|
||||
character followed by a sequence of zero or more combining characters.
|
||||
|
||||
[h5 Any other escape]
|
||||
|
||||
Any other escape sequence matches the character that is escaped, for example
|
||||
\\@ matches a literal '@'.
|
||||
|
||||
[h4 Perl Extended Patterns]
|
||||
|
||||
Perl-specific extensions to the regular expression syntax all start with `(?`.
|
||||
|
||||
[h5 Comments]
|
||||
|
||||
`(?# ... )` is treated as a comment, it's contents are ignored.
|
||||
|
||||
[h5 Modifiers]
|
||||
|
||||
`(?imsx-imsx ... )` alters which of the perl modifiers are in effect within
|
||||
the pattern, changes take effect from the point that the block is first seen
|
||||
and extend to any enclosing `)`. Letters before a '-' turn that perl
|
||||
modifier on, letters afterward, turn it off.
|
||||
|
||||
`(?imsx-imsx:pattern)` applies the specified modifiers to pattern only.
|
||||
|
||||
[h5 Non-marking groups]
|
||||
|
||||
`(?:pattern)` lexically groups pattern, without generating an additional
|
||||
sub-expression.
|
||||
|
||||
[h5 Lookahead]
|
||||
|
||||
`(?=pattern)` consumes zero characters, only if pattern matches.
|
||||
|
||||
`(?!pattern)` consumes zero characters, only if pattern does not match.
|
||||
|
||||
Lookahead is typically used to create the logical AND of two regular
|
||||
expressions, for example if a password must contain a lower case letter,
|
||||
an upper case letter, a punctuation symbol, and be at least 6 characters long,
|
||||
then the expression:
|
||||
|
||||
(?=.*[[:lower:]])(?=.*[[:upper:]])(?=.*[[:punct:]]).{6,}
|
||||
|
||||
could be used to validate the password.
|
||||
|
||||
[h5 Lookbehind]
|
||||
|
||||
`(?<=pattern)` consumes zero characters, only if pattern could be matched
|
||||
against the characters preceding the current position (pattern must be
|
||||
of fixed length).
|
||||
|
||||
`(?<!pattern)` consumes zero characters, only if pattern could not be
|
||||
matched against the characters preceding the current position (pattern must
|
||||
be of fixed length).
|
||||
|
||||
[h5 Independent sub-expressions]
|
||||
|
||||
`(?>pattern)` /pattern/ is matched independently of the surrounding patterns,
|
||||
the expression will never backtrack into /pattern/. Independent sub-expressions
|
||||
are typically used to improve performance; only the best possible match
|
||||
for pattern will be considered, if this doesn't allow the expression as a
|
||||
whole to match then no match is found at all.
|
||||
|
||||
[h5 Conditional Expressions]
|
||||
|
||||
`(?(condition)yes-pattern|no-pattern)` attempts to match /yes-pattern/ if
|
||||
the /condition/ is true, otherwise attempts to match /no-pattern/.
|
||||
|
||||
`(?(condition)yes-pattern)` attempts to match /yes-pattern/ if the /condition/
|
||||
is true, otherwise fails.
|
||||
|
||||
/condition/ may be either a forward lookahead assert, or the index of
|
||||
a marked sub-expression (the condition becomes true if the sub-expression
|
||||
has been matched).
|
||||
|
||||
[h4 Operator precedence]
|
||||
|
||||
The order of precedence for of operators is as follows:
|
||||
|
||||
# Collation-related bracket symbols `[==] [::] [..]`
|
||||
# Escaped characters `\`
|
||||
# Character set (bracket expression) `[]`
|
||||
# Grouping `()`
|
||||
# Single-character-ERE duplication `* + ? {m,n}`
|
||||
# Concatenation
|
||||
# Anchoring ^$
|
||||
# Alternation |
|
||||
|
||||
[h3 What gets matched]
|
||||
|
||||
If you view the regular expression as a directed (possibly cyclic)
|
||||
graph, then the best match found is the first match found by a
|
||||
depth-first-search performed on that graph, while matching the input text.
|
||||
|
||||
Alternatively:
|
||||
|
||||
The best match found is the
|
||||
[link boost_regex.syntax.leftmost_longest_rule leftmost match],
|
||||
with individual elements matched as follows;
|
||||
|
||||
[table
|
||||
[[Construct][What gets matched]]
|
||||
[[`AtomA AtomB`][Locates the best match for /AtomA/ that has a following match for /AtomB/.]]
|
||||
[[`Expression1 | Expression2`][If /Expresion1/ can be matched then returns that match,
|
||||
otherwise attempts to match /Expression2/.]]
|
||||
[[`S{N}`][Matches /S/ repeated exactly N times.]]
|
||||
[[`S{N,M}`][Matches S repeated between N and M times, and as many times as possible.]]
|
||||
[[`S{N,M}?`][Matches S repeated between N and M times, and as few times as possible.]]
|
||||
[[`S?, S*, S+`][The same as `S{0,1}`, `S{0,UINT_MAX}`, `S{1,UINT_MAX}` respectively.]]
|
||||
[[`S??, S*?, S+?`][The same as `S{0,1}?`, `S{0,UINT_MAX}?`, `S{1,UINT_MAX}?` respectively.]]
|
||||
[[`(?>S)`][Matches the best match for /S/, and only that.]]
|
||||
[[`(?=S), (?<=S)`][Matches only the best match for /S/ (this is only
|
||||
visible if there are capturing parenthesis within /S/).]]
|
||||
[[`(?!S), (?<!S)`][Considers only whether a match for S exists or not.]]
|
||||
[[`(?(condition)yes-pattern | no-pattern)`][If condition is true, then
|
||||
only yes-pattern is considered, otherwise only no-pattern is considered.]]
|
||||
]
|
||||
|
||||
[h3 Variations]
|
||||
|
||||
The [link boost_regex.ref.syntax_option_type.syntax_option_type_perl options `normal`,
|
||||
`ECMAScript`, `JavaScript` and `JScript`] are all synonyms for
|
||||
`perl`.
|
||||
|
||||
[h3 Options]
|
||||
|
||||
There are a [link boost_regex.ref.syntax_option_type.syntax_option_type_perl
|
||||
variety of flags] that may be combined with the `perl` option when
|
||||
constructing the regular expression, in particular note that the
|
||||
`newline_alt` option alters the syntax, while the `collate`, `nosubs` and
|
||||
`icase` options modify how the case and locale sensitivity are to be applied.
|
||||
|
||||
[h3 Pattern Modifiers]
|
||||
|
||||
The perl `smix` modifiers can either be applied using a `(?smix-smix)`
|
||||
prefix to the regular expression, or with one of the
|
||||
[link boost_regex.ref.syntax_option_type.syntax_option_type_perl regex-compile time
|
||||
flags `no_mod_m`, `mod_x`, `mod_s`, and `no_mod_s`].
|
||||
|
||||
[h3 References]
|
||||
|
||||
[@http://perldoc.perl.org/perlre.html Perl 5.8].
|
||||
|
||||
|
||||
[endsect]
|
||||
|
Reference in New Issue
Block a user