2001-09-28 11:07:41 +00:00
|
|
|
<html>
|
|
|
|
|
|
|
|
<head>
|
|
|
|
<meta http-equiv="Content-Type"
|
|
|
|
content="text/html; charset=iso-8859-1">
|
|
|
|
<meta name="Template"
|
|
|
|
content="C:\PROGRAM FILES\MICROSOFT OFFICE\OFFICE\html.dot">
|
|
|
|
<meta name="GENERATOR" content="Microsoft FrontPage Express 2.0">
|
|
|
|
<title>Regex++, Regular Expression Syntax</title>
|
|
|
|
</head>
|
|
|
|
|
|
|
|
<body bgcolor="#FFFFFF" link="#0000FF" vlink="#800080">
|
|
|
|
|
|
|
|
<p> </p>
|
|
|
|
|
|
|
|
<table border="0" cellpadding="7" cellspacing="0" width="100%">
|
|
|
|
<tr>
|
2001-09-30 10:30:14 +00:00
|
|
|
<td valign="top"><h3><img src="../../c++boost.gif"
|
|
|
|
alt="C++ Boost" width="276" height="86"></h3>
|
2001-09-28 11:07:41 +00:00
|
|
|
</td>
|
2001-09-30 10:30:14 +00:00
|
|
|
<td valign="top"><h3 align="center">Regex++, Regular
|
|
|
|
Expression Syntax.</h3>
|
|
|
|
<p align="left"><i>(Version 3.20, 29th Sept 2001)</i>
|
|
|
|
</p>
|
|
|
|
<p align="left"><i>Copyright (c) 1998-2001 </i></p>
|
|
|
|
<p align="left"><i>Dr John Maddock</i></p>
|
|
|
|
<p align="left"><i>Permission to use, copy, modify,
|
|
|
|
distribute and sell this software and its documentation
|
|
|
|
for any purpose is hereby granted without fee, provided
|
|
|
|
that the above copyright notice appear in all copies and
|
|
|
|
that both that copyright notice and this permission
|
|
|
|
notice appear in supporting documentation. Dr John
|
|
|
|
Maddock makes no representations about the suitability of
|
|
|
|
this software for any purpose. It is provided "as is"
|
|
|
|
without express or implied warranty.</i></p>
|
2001-09-28 11:07:41 +00:00
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
|
|
|
|
<hr>
|
|
|
|
|
|
|
|
<h3><a name="syntax"></a><i>Regular expression syntax</i></h3>
|
|
|
|
|
|
|
|
<p>This section covers the regular expression syntax used by this
|
|
|
|
library, this is a programmers guide, the actual syntax presented
|
|
|
|
to your program's users will depend upon the flags used during
|
|
|
|
expression compilation. </p>
|
|
|
|
|
|
|
|
<p><i>Literals</i> </p>
|
|
|
|
|
|
|
|
<p>All characters are literals except: ".",
|
|
|
|
"*", "?", "+", "(",
|
|
|
|
")", "{", "}", "[",
|
|
|
|
"]", "^" and "$". These characters
|
|
|
|
are literals when preceded by a "\". A literal is a
|
2000-09-26 11:48:28 +00:00
|
|
|
character that matches itself, or matches the result of
|
2001-09-28 11:07:41 +00:00
|
|
|
traits_type::translate(), where traits_type is the traits
|
|
|
|
template parameter to class reg_expression. <br>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p><i>Wildcard</i> </p>
|
|
|
|
|
|
|
|
<p>The dot character "." matches any single character
|
|
|
|
except : when <i>match_not_dot_null</i> is passed to the matching
|
|
|
|
algorithms, the dot does not match a null character; when <i>match_not_dot_newline</i>
|
|
|
|
is passed to the matching algorithms, then the dot does not match
|
|
|
|
a newline character. <br>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p><i>Repeats</i> </p>
|
|
|
|
|
|
|
|
<p>A repeat is an expression that is repeated an arbitrary number
|
|
|
|
of times. An expression followed by "*" can be repeated
|
|
|
|
any number of times including zero. An expression followed by
|
|
|
|
"+" can be repeated any number of times, but at least
|
|
|
|
once, if the expression is compiled with the flag regbase::bk_plus_qm
|
|
|
|
then "+" is an ordinary character and "\+"
|
|
|
|
represents a repeat of once or more. An expression followed by
|
|
|
|
"?" may be repeated zero or one times only, if the
|
|
|
|
expression is compiled with the flag regbase::bk_plus_qm then
|
|
|
|
"?" is an ordinary character and "\?"
|
|
|
|
represents the repeat zero or once operator. When it is necessary
|
|
|
|
to specify the minimum and maximum number of repeats explicitly,
|
|
|
|
the bounds operator "{}" may be used, thus "a{2}"
|
|
|
|
is the letter "a" repeated exactly twice, "a{2,4}"
|
|
|
|
represents the letter "a" repeated between 2 and 4
|
|
|
|
times, and "a{2,}" represents the letter "a"
|
|
|
|
repeated at least twice with no upper limit. Note that there must
|
|
|
|
be no white-space inside the {}, and there is no upper limit on
|
|
|
|
the values of the lower and upper bounds. When the expression is
|
|
|
|
compiled with the flag regbase::bk_braces then "{" and
|
|
|
|
"}" are ordinary characters and "\{" and
|
|
|
|
"\}" are used to delimit bounds instead. All repeat
|
|
|
|
expressions refer to the shortest possible previous sub-expression:
|
|
|
|
a single character; a character set, or a sub-expression grouped
|
|
|
|
with "()" for example. </p>
|
|
|
|
|
|
|
|
<p>Examples: </p>
|
|
|
|
|
|
|
|
<p>"ba*" will match all of "b", "ba",
|
|
|
|
"baaa" etc. </p>
|
|
|
|
|
|
|
|
<p>"ba+" will match "ba" or "baaaa"
|
|
|
|
for example but not "b". </p>
|
|
|
|
|
|
|
|
<p>"ba?" will match "b" or "ba". </p>
|
|
|
|
|
|
|
|
<p>"ba{2,4}" will match "baa", "baaa"
|
|
|
|
and "baaaa". </p>
|
|
|
|
|
|
|
|
<p><i>Non-greedy repeats</i> </p>
|
|
|
|
|
|
|
|
<p>Whenever the "extended" regular expression syntax is
|
|
|
|
in use (the default) then non-greedy repeats are possible by
|
|
|
|
appending a '?' after the repeat; a non-greedy repeat is one
|
|
|
|
which will match the <i>shortest</i> possible string. </p>
|
|
|
|
|
|
|
|
<p>For example to match html tag pairs one could use something
|
|
|
|
like: </p>
|
|
|
|
|
|
|
|
<p>"<\s*tagname[^>]*>(.*?)<\s*/tagname\s*>"
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p>In this case $1 will contain the text between the tag pairs,
|
|
|
|
and will be the shortest possible matching string. <br>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p><i>Parenthesis</i> </p>
|
|
|
|
|
|
|
|
<p>Parentheses serve two purposes, to group items together into a
|
|
|
|
sub-expression, and to mark what generated the match. For example
|
|
|
|
the expression "(ab)*" would match all of the string
|
|
|
|
"ababab". The matching algorithms <a
|
|
|
|
href="template_class_ref.htm#query_match">regex_match</a> and <a
|
|
|
|
href="template_class_ref.htm#reg_search">regex_search</a> each
|
|
|
|
take an instance of <a
|
|
|
|
href="template_class_ref.htm#match_results">match_results</a>
|
|
|
|
that reports what caused the match, on exit from these functions
|
|
|
|
the <a href="template_class_ref.htm#match_results">match_results</a>
|
|
|
|
contains information both on what the whole expression matched
|
|
|
|
and on what each sub-expression matched. In the example above
|
|
|
|
match_results[1] would contain a pair of iterators denoting the
|
|
|
|
final "ab" of the matching string. It is permissible
|
|
|
|
for sub-expressions to match null strings. If a sub-expression
|
|
|
|
takes no part in a match - for example if it is part of an
|
|
|
|
alternative that is not taken - then both of the iterators that
|
|
|
|
are returned for that sub-expression point to the end of the
|
|
|
|
input string, and the <i>matched</i> parameter for that sub-expression
|
|
|
|
is <i>false</i>. Sub-expressions are indexed from left to right
|
|
|
|
starting from 1, sub-expression 0 is the whole expression. </p>
|
|
|
|
|
|
|
|
<p><i>Non-Marking Parenthesis</i> </p>
|
|
|
|
|
|
|
|
<p>Sometimes you need to group sub-expressions with parenthesis,
|
|
|
|
but don't want the parenthesis to spit out another marked sub-expression,
|
|
|
|
in this case a non-marking parenthesis (?:expression) can be used.
|
|
|
|
For example the following expression creates no sub-expressions: </p>
|
|
|
|
|
|
|
|
<p>"(?:abc)*" <br>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p><i>Alternatives</i> </p>
|
|
|
|
|
|
|
|
<p>Alternatives occur when the expression can match either one
|
|
|
|
sub-expression or another, each alternative is separated by a
|
|
|
|
"|", or a "\|" if the flag regbase::bk_vbar
|
|
|
|
is set, or by a newline character if the flag regbase::newline_alt
|
|
|
|
is set. Each alternative is the largest possible previous sub-expression;
|
|
|
|
this is the opposite behaviour from repetition operators. </p>
|
|
|
|
|
|
|
|
<p>Examples: </p>
|
|
|
|
|
|
|
|
<p>"a(b|c)" could match "ab" or "ac".
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p>"abc|def" could match "abc" or "def".
|
|
|
|
<br>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p><i>Sets</i> </p>
|
|
|
|
|
|
|
|
<p>A set is a set of characters that can match any single
|
|
|
|
character that is a member of the set. Sets are delimited by
|
|
|
|
"[" and "]" and can contain literals,
|
|
|
|
character ranges, character classes, collating elements and
|
|
|
|
equivalence classes. Set declarations that start with "^"
|
|
|
|
contain the compliment of the elements that follow. </p>
|
|
|
|
|
|
|
|
<p>Examples: </p>
|
|
|
|
|
|
|
|
<p>Character literals: </p>
|
|
|
|
|
|
|
|
<p>"[abc]" will match either of "a", "b",
|
|
|
|
or "c". </p>
|
|
|
|
|
|
|
|
<p>"[^abc] will match any character other than "a",
|
|
|
|
"b", or "c". </p>
|
|
|
|
|
|
|
|
<p>Character ranges: </p>
|
|
|
|
|
|
|
|
<p>"[a-z]" will match any character in the range "a"
|
|
|
|
to "z". </p>
|
|
|
|
|
|
|
|
<p>"[^A-Z]" will match any character other than those
|
|
|
|
in the range "A" to "Z". </p>
|
|
|
|
|
|
|
|
<p>Note that character ranges are highly locale dependent: they
|
|
|
|
match any character that collates between the endpoints of the
|
|
|
|
range, ranges will only behave according to ASCII rules when the
|
|
|
|
default "C" locale is in effect. For example if the
|
|
|
|
library is compiled with the Win32 localization model, then [a-z]
|
|
|
|
will match the ASCII characters a-z, and also 'A', 'B' etc, but
|
|
|
|
not 'Z' which collates just after 'z'. This locale specific
|
|
|
|
behaviour can be disabled by specifying regbase::nocollate when
|
|
|
|
compiling, this is the default behaviour when using regbase::normal,
|
|
|
|
and forces ranges to collate according to ASCII character code.
|
|
|
|
Likewise, if you use the POSIX C API functions then setting
|
|
|
|
REG_NOCOLLATE turns off locale dependent collation. </p>
|
|
|
|
|
|
|
|
<p>Character classes are denoted using the syntax "[:classname:]"
|
|
|
|
within a set declaration, for example "[[:space:]]" is
|
|
|
|
the set of all whitespace characters. Character classes are only
|
|
|
|
available if the flag regbase::char_classes is set. The available
|
|
|
|
character classes are: <br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<table border="0" cellpadding="7" cellspacing="0" width="100%">
|
|
|
|
<tr>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
<td valign="top" width="50%">alnum</td>
|
|
|
|
<td valign="top" width="50%">Any alpha numeric character.</td>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="50%">alpha</td>
|
|
|
|
<td valign="top" width="50%">Any alphabetical character a-z
|
|
|
|
and A-Z. Other characters may also be included depending
|
|
|
|
upon the locale.</td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="50%">blank</td>
|
|
|
|
<td valign="top" width="50%">Any blank character, either
|
|
|
|
a space or a tab.</td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="50%">cntrl</td>
|
|
|
|
<td valign="top" width="50%">Any control character.</td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="50%">digit</td>
|
|
|
|
<td valign="top" width="50%">Any digit 0-9.</td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="50%">graph</td>
|
|
|
|
<td valign="top" width="50%">Any graphical character.</td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="50%">lower</td>
|
|
|
|
<td valign="top" width="50%">Any lower case character a-z.
|
|
|
|
Other characters may also be included depending upon the
|
|
|
|
locale.</td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="50%">print</td>
|
|
|
|
<td valign="top" width="50%">Any printable character.</td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="50%">punct</td>
|
|
|
|
<td valign="top" width="50%">Any punctuation character.</td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="50%">space</td>
|
|
|
|
<td valign="top" width="50%">Any whitespace character.</td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="50%">upper</td>
|
|
|
|
<td valign="top" width="50%">Any upper case character A-Z.
|
|
|
|
Other characters may also be included depending upon the
|
|
|
|
locale.</td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="50%">xdigit</td>
|
|
|
|
<td valign="top" width="50%">Any hexadecimal digit
|
|
|
|
character, 0-9, a-f and A-F.</td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="50%">word</td>
|
|
|
|
<td valign="top" width="50%">Any word character - all
|
|
|
|
alphanumeric characters plus the underscore.</td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="50%">unicode</td>
|
|
|
|
<td valign="top" width="50%">Any character whose code is
|
|
|
|
greater than 255, this applies to the wide character
|
|
|
|
traits classes only.</td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
|
|
|
|
<p>There are some shortcuts that can be used in place of the
|
|
|
|
character classes, provided the flag regbase::escape_in_lists is
|
|
|
|
set then you can use: </p>
|
|
|
|
|
|
|
|
<p>\w in place of [:word:] </p>
|
|
|
|
|
|
|
|
<p>\s in place of [:space:] </p>
|
|
|
|
|
|
|
|
<p>\d in place of [:digit:] </p>
|
|
|
|
|
|
|
|
<p>\l in place of [:lower:] </p>
|
|
|
|
|
|
|
|
<p>\u in place of [:upper:] <br>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p>Collating elements take the general form [.tagname.] inside a
|
|
|
|
set declaration, where <i>tagname</i> is either a single
|
|
|
|
character, or a name of a collating element, for example [[.a.]]
|
|
|
|
is equivalent to [a], and [[.comma.]] is equivalent to [,]. The
|
|
|
|
library supports all the standard POSIX collating element names,
|
|
|
|
and in addition the following digraphs: "ae", "ch",
|
|
|
|
"ll", "ss", "nj", "dz",
|
|
|
|
"lj", each in lower, upper and title case variations.
|
|
|
|
Multi-character collating elements can result in the set matching
|
|
|
|
more than one character, for example [[.ae.]] would match two
|
|
|
|
characters, but note that [^[.ae.]] would only match one
|
|
|
|
character. <br>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p>Equivalence classes take the general form [=tagname=] inside a
|
|
|
|
set declaration, where <i>tagname</i> is either a single
|
|
|
|
character, or a name of a collating element, and matches any
|
|
|
|
character that is a member of the same primary equivalence class
|
|
|
|
as the collating element [.tagname.]. An equivalence class is a
|
|
|
|
set of characters that collate the same, a primary equivalence
|
|
|
|
class is a set of characters whose primary sort key are all the
|
|
|
|
same (for example strings are typically collated by character,
|
|
|
|
then by accent, and then by case; the primary sort key then
|
|
|
|
relates to the character, the secondary to the accentation, and
|
|
|
|
the tertiary to the case). If there is no equivalence class
|
|
|
|
corresponding to <i>tagname</i>, then [=tagname=] is exactly the
|
|
|
|
same as [.tagname.]. Unfortunately there is no locale independent
|
|
|
|
method of obtaining the primary sort key for a character, except
|
|
|
|
under Win32. For other operating systems the library will "guess"
|
|
|
|
the primary sort key from the full sort key (obtained from <i>strxfrm</i>),
|
|
|
|
so equivalence classes are probably best considered broken under
|
|
|
|
any operating system other than Win32. <br>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p>To include a literal "-" in a set declaration then:
|
|
|
|
make it the first character after the opening "[" or
|
|
|
|
"[^", the endpoint of a range, a collating element, or
|
|
|
|
if the flag regbase::escape_in_lists is set then precede with an
|
|
|
|
escape character as in "[\-]". To include a literal
|
|
|
|
"[" or "]" or "^" in a set then
|
|
|
|
make them the endpoint of a range, a collating element, or
|
|
|
|
precede with an escape character if the flag regbase::escape_in_lists
|
|
|
|
is set. <br>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p><i>Line anchors</i> </p>
|
|
|
|
|
|
|
|
<p>An anchor is something that matches the null string at the
|
|
|
|
start or end of a line: "^" matches the null string at
|
|
|
|
the start of a line, "$" matches the null string at the
|
|
|
|
end of a line. <br>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p><i>Back references</i> </p>
|
|
|
|
|
|
|
|
<p>A back reference is a reference to a previous sub-expression
|
|
|
|
that has already been matched, the reference is to what the sub-expression
|
|
|
|
matched, not to the expression itself. A back reference consists
|
|
|
|
of the escape character "\" followed by a digit "1"
|
|
|
|
to "9", "\1" refers to the first sub-expression,
|
|
|
|
"\2" to the second etc. For example the expression
|
|
|
|
"(.*)\1" matches any string that is repeated about its
|
|
|
|
mid-point for example "abcabc" or "xyzxyz". A
|
|
|
|
back reference to a sub-expression that did not participate in
|
|
|
|
any match, matches the null string: NB this is different to some
|
|
|
|
other regular expression matchers. Back references are only
|
|
|
|
available if the expression is compiled with the flag regbase::bk_refs
|
|
|
|
set. <br>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p><i>Characters by code</i> </p>
|
|
|
|
|
|
|
|
<p>This is an extension to the algorithm that is not available in
|
|
|
|
other libraries, it consists of the escape character followed by
|
|
|
|
the digit "0" followed by the octal character code. For
|
|
|
|
example "\023" represents the character whose octal
|
|
|
|
code is 23. Where ambiguity could occur use parentheses to break
|
|
|
|
the expression up: "\0103" represents the character
|
|
|
|
whose code is 103, "(\010)3 represents the character 10
|
|
|
|
followed by "3". To match characters by their
|
|
|
|
hexadecimal code, use \x followed by a string of hexadecimal
|
|
|
|
digits, optionally enclosed inside {}, for example \xf0 or
|
|
|
|
\x{aff}, notice the latter example is a Unicode character. <br>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p><i>Word operators</i> </p>
|
|
|
|
|
|
|
|
<p>The following operators are provided for compatibility with
|
|
|
|
the GNU regular expression library. </p>
|
|
|
|
|
|
|
|
<p>"\w" matches any single character that is a member
|
|
|
|
of the "word" character class, this is identical to the
|
|
|
|
expression "[[:word:]]". </p>
|
|
|
|
|
|
|
|
<p>"\W" matches any single character that is not a
|
|
|
|
member of the "word" character class, this is identical
|
|
|
|
to the expression "[^[:word:]]". </p>
|
|
|
|
|
|
|
|
<p>"\<" matches the null string at the start of a
|
|
|
|
word. </p>
|
|
|
|
|
|
|
|
<p>"\>" matches the null string at the end of the
|
|
|
|
word. </p>
|
|
|
|
|
|
|
|
<p>"\b" matches the null string at either the start or
|
|
|
|
the end of a word. </p>
|
|
|
|
|
|
|
|
<p>"\B" matches a null string within a word. </p>
|
|
|
|
|
|
|
|
<p>The start of the sequence passed to the matching algorithms is
|
|
|
|
considered to be a potential start of a word unless the flag
|
|
|
|
match_not_bow is set. The end of the sequence passed to the
|
|
|
|
matching algorithms is considered to be a potential end of a word
|
|
|
|
unless the flag match_not_eow is set. <br>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p><i>Buffer operators</i> </p>
|
|
|
|
|
|
|
|
<p>The following operators are provide for compatibility with the
|
|
|
|
GNU regular expression library, and Perl regular expressions: </p>
|
|
|
|
|
|
|
|
<p>"\`" matches the start of a buffer. </p>
|
|
|
|
|
|
|
|
<p>"\A" matches the start of the buffer. </p>
|
|
|
|
|
|
|
|
<p>"\'" matches the end of a buffer. </p>
|
|
|
|
|
|
|
|
<p>"\z" matches the end of a buffer. </p>
|
|
|
|
|
|
|
|
<p>"\Z" matches the end of a buffer, or possibly one or
|
|
|
|
more new line characters followed by the end of the buffer. </p>
|
|
|
|
|
|
|
|
<p>A buffer is considered to consist of the whole sequence passed
|
|
|
|
to the matching algorithms, unless the flags match_not_bob or
|
|
|
|
match_not_eob are set. <br>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p><i>Escape operator</i> </p>
|
|
|
|
|
|
|
|
<p>The escape character "\" has several meanings. </p>
|
|
|
|
|
|
|
|
<p>Inside a set declaration the escape character is a normal
|
|
|
|
character unless the flag regbase::escape_in_lists is set in
|
|
|
|
which case whatever follows the escape is a literal character
|
|
|
|
regardless of its normal meaning. </p>
|
|
|
|
|
|
|
|
<p>The escape operator may introduce an operator for example:
|
|
|
|
back references, or a word operator. </p>
|
|
|
|
|
|
|
|
<p>The escape operator may make the following character normal,
|
|
|
|
for example "\*" represents a literal "*"
|
|
|
|
rather than the repeat operator. <br>
|
|
|
|
<br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p><i>Single character escape sequences</i> </p>
|
|
|
|
|
|
|
|
<p>The following escape sequences are aliases for single
|
|
|
|
characters: <br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<table border="0" cellpadding="7" cellspacing="0" width="100%">
|
|
|
|
<tr>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
<td valign="top" width="33%">Escape sequence </td>
|
|
|
|
<td valign="top" width="33%">Character code </td>
|
|
|
|
<td valign="top" width="33%">Meaning </td>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="33%">\a </td>
|
|
|
|
<td valign="top" width="33%">0x07 </td>
|
|
|
|
<td valign="top" width="33%">Bell character. </td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="33%">\f </td>
|
|
|
|
<td valign="top" width="33%">0x0C </td>
|
|
|
|
<td valign="top" width="33%">Form feed. </td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="33%">\n </td>
|
|
|
|
<td valign="top" width="33%">0x0A </td>
|
|
|
|
<td valign="top" width="33%">Newline character. </td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="33%">\r </td>
|
|
|
|
<td valign="top" width="33%">0x0D </td>
|
|
|
|
<td valign="top" width="33%">Carriage return. </td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="33%">\t </td>
|
|
|
|
<td valign="top" width="33%">0x09 </td>
|
|
|
|
<td valign="top" width="33%">Tab character. </td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="33%">\v </td>
|
|
|
|
<td valign="top" width="33%">0x0B </td>
|
|
|
|
<td valign="top" width="33%">Vertical tab. </td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="33%">\e </td>
|
|
|
|
<td valign="top" width="33%">0x1B </td>
|
|
|
|
<td valign="top" width="33%">ASCII Escape character. </td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="33%">\0dd </td>
|
|
|
|
<td valign="top" width="33%">0dd </td>
|
|
|
|
<td valign="top" width="33%">An octal character code,
|
|
|
|
where <i>dd</i> is one or more octal digits. </td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="33%">\xXX </td>
|
|
|
|
<td valign="top" width="33%">0xXX </td>
|
|
|
|
<td valign="top" width="33%">A hexadecimal character
|
|
|
|
code, where XX is one or more hexadecimal digits. </td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="33%">\x{XX} </td>
|
|
|
|
<td valign="top" width="33%">0xXX </td>
|
|
|
|
<td valign="top" width="33%">A hexadecimal character
|
|
|
|
code, where XX is one or more hexadecimal digits,
|
|
|
|
optionally a unicode character. </td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td> </td>
|
|
|
|
<td valign="top" width="33%">\cZ </td>
|
|
|
|
<td valign="top" width="33%">z-@ </td>
|
|
|
|
<td valign="top" width="33%">An ASCII escape sequence
|
|
|
|
control-Z, where Z is any ASCII character greater than or
|
|
|
|
equal to the character code for '@'. </td>
|
|
|
|
<td> </td>
|
|
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
|
|
|
|
<p><br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p><i>Miscellaneous escape sequences:</i> </p>
|
|
|
|
|
|
|
|
<p>The following are provided mostly for perl compatibility, but
|
|
|
|
note that there are some differences in the meanings of \l \L \u
|
|
|
|
and \U: <br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<table border="0" cellpadding="6" cellspacing="0" width="100%">
|
|
|
|
<tr>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
<td valign="top" width="45%">\w </td>
|
|
|
|
<td valign="top" width="45%">Equivalent to [[:word:]]. </td>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
<td valign="top" width="45%">\W </td>
|
|
|
|
<td valign="top" width="45%">Equivalent to [^[:word:]]. </td>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
<td valign="top" width="45%">\s </td>
|
|
|
|
<td valign="top" width="45%">Equivalent to [[:space:]]. </td>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
<td valign="top" width="45%">\S </td>
|
|
|
|
<td valign="top" width="45%">Equivalent to [^[:space:]]. </td>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
<td valign="top" width="45%">\d </td>
|
|
|
|
<td valign="top" width="45%">Equivalent to [[:digit:]]. </td>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
<td valign="top" width="45%">\D </td>
|
|
|
|
<td valign="top" width="45%">Equivalent to [^[:digit:]]. </td>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
<td valign="top" width="45%">\l </td>
|
|
|
|
<td valign="top" width="45%">Equivalent to [[:lower:]]. </td>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
<td valign="top" width="45%">\L </td>
|
|
|
|
<td valign="top" width="45%">Equivalent to [^[:lower:]]. </td>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
<td valign="top" width="45%">\u </td>
|
|
|
|
<td valign="top" width="45%">Equivalent to [[:upper:]]. </td>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
<td valign="top" width="45%">\U </td>
|
|
|
|
<td valign="top" width="45%">Equivalent to [^[:upper:]]. </td>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
<td valign="top" width="45%">\C </td>
|
|
|
|
<td valign="top" width="45%">Any single character,
|
|
|
|
equivalent to '.'. </td>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
<td valign="top" width="45%">\X </td>
|
|
|
|
<td valign="top" width="45%">Match any Unicode combining
|
|
|
|
character sequence, for example "a\x 0301" (a
|
|
|
|
letter a with an acute). </td>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
<td valign="top" width="45%">\Q </td>
|
|
|
|
<td valign="top" width="45%">The begin quote operator,
|
|
|
|
everything that follows is treated as a literal character
|
|
|
|
until a \E end quote operator is found. </td>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
</tr>
|
|
|
|
<tr>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
<td valign="top" width="45%">\E </td>
|
|
|
|
<td valign="top" width="45%">The end quote operator,
|
|
|
|
terminates a sequence begun with \Q. </td>
|
|
|
|
<td width="5%"> </td>
|
|
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
|
|
|
|
<p><br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p><i>What gets matched?</i> </p>
|
|
|
|
|
|
|
|
<p>The regular expression library will match the first possible
|
|
|
|
matching string, if more than one string starting at a given
|
|
|
|
location can match then it matches the longest possible string,
|
|
|
|
unless the flag match_any is set, in which case the first match
|
|
|
|
encountered is returned. Use of the match_any option can reduce
|
|
|
|
the time taken to find the match - but is only useful if the user
|
|
|
|
is less concerned about what matched - for example it would not
|
|
|
|
be suitable for search and replace operations. In cases where
|
|
|
|
their are multiple possible matches all starting at the same
|
|
|
|
location, and all of the same length, then the match chosen is
|
|
|
|
the one with the longest first sub-expression, if that is the
|
|
|
|
same for two or more matches, then the second sub-expression will
|
|
|
|
be examined and so on. <br>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<hr>
|
2000-09-26 11:48:28 +00:00
|
|
|
|
2001-09-28 11:07:41 +00:00
|
|
|
<p><i>Copyright </i><a href="mailto:John_Maddock@compuserve.com"><i>Dr
|
|
|
|
John Maddock</i></a><i> 1998-2000 all rights reserved.</i> </p>
|
|
|
|
</body>
|
|
|
|
</html>
|