2003-05-17 11:55:51 +00:00
|
|
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
|
|
|
|
|
<html>
|
|
|
|
|
<head>
|
|
|
|
|
<title>Boost.Regex: Regular Expression Syntax</title>
|
|
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
|
|
|
<link rel="stylesheet" type="text/css" href="../../../boost.css">
|
|
|
|
|
</head>
|
|
|
|
|
<body>
|
|
|
|
|
<P>
|
|
|
|
|
<TABLE id="Table1" cellSpacing="1" cellPadding="1" width="100%" border="0">
|
|
|
|
|
<TR>
|
|
|
|
|
<td valign="top" width="300">
|
|
|
|
|
<h3><a href="../../../index.htm"><img height="86" width="277" alt="C++ Boost" src="../../../c++boost.gif" border="0"></a></h3>
|
|
|
|
|
</td>
|
|
|
|
|
<TD width="353">
|
|
|
|
|
<H1 align="center">Boost.Regex</H1>
|
|
|
|
|
<H2 align="center">Regular Expression Syntax</H2>
|
|
|
|
|
</TD>
|
|
|
|
|
<td width="50">
|
|
|
|
|
<h3><a href="index.html"><img height="45" width="43" alt="Boost.Regex Index" src="uarrow.gif" border="0"></a></h3>
|
|
|
|
|
</td>
|
|
|
|
|
</TR>
|
|
|
|
|
</TABLE>
|
|
|
|
|
</P>
|
|
|
|
|
<HR>
|
|
|
|
|
<P>This section covers the regular expression syntax used by this library, this is
|
|
|
|
|
a programmers guide, the actual syntax presented to your program's users will
|
|
|
|
|
depend upon the flags used during expression compilation.
|
|
|
|
|
</P>
|
|
|
|
|
<H3>Literals
|
|
|
|
|
</H3>
|
|
|
|
|
<P>All characters are literals except: ".", "|", "*", "?", "+", "(", ")", "{",
|
|
|
|
|
"}", "[", "]", "^", "$" and "\". These characters are literals when preceded by
|
|
|
|
|
a "\". A literal is a character that matches itself, or matches the result of
|
|
|
|
|
traits_type::translate(), where traits_type is the traits template parameter to
|
|
|
|
|
class basic_regex.</P>
|
|
|
|
|
<H3>Wildcard
|
|
|
|
|
</H3>
|
|
|
|
|
<P>The dot character "." matches any single character except : when <I>match_not_dot_null</I>
|
|
|
|
|
is passed to the matching algorithms, the dot does not match a null character;
|
|
|
|
|
when <I>match_not_dot_newline</I> is passed to the matching algorithms, then
|
|
|
|
|
the dot does not match a newline character.
|
|
|
|
|
</P>
|
|
|
|
|
<H3>Repeats
|
|
|
|
|
</H3>
|
|
|
|
|
<P>A repeat is an expression that is repeated an arbitrary number of times. An
|
|
|
|
|
expression followed by "*" can be repeated any number of times including zero.
|
|
|
|
|
An expression followed by "+" can be repeated any number of times, but at least
|
|
|
|
|
once, if the expression is compiled with the flag regex_constants::bk_plus_qm
|
|
|
|
|
then "+" is an ordinary character and "\+" represents a repeat of once or more.
|
|
|
|
|
An expression followed by "?" may be repeated zero or one times only, if the
|
|
|
|
|
expression is compiled with the flag regex_constants::bk_plus_qm then "?" is an
|
|
|
|
|
ordinary character and "\?" represents the repeat zero or once operator. When
|
|
|
|
|
it is necessary to specify the minimum and maximum number of repeats
|
|
|
|
|
explicitly, the bounds operator "{}" may be used, thus "a{2}" is the letter "a"
|
|
|
|
|
repeated exactly twice, "a{2,4}" represents the letter "a" repeated between 2
|
|
|
|
|
and 4 times, and "a{2,}" represents the letter "a" repeated at least twice with
|
|
|
|
|
no upper limit. Note that there must be no white-space inside the {}, and there
|
|
|
|
|
is no upper limit on the values of the lower and upper bounds. When the
|
|
|
|
|
expression is compiled with the flag regex_constants::bk_braces then "{" and
|
|
|
|
|
"}" are ordinary characters and "\{" and "\}" are used to delimit bounds
|
|
|
|
|
instead. All repeat expressions refer to the shortest possible previous
|
|
|
|
|
sub-expression: a single character; a character set, or a sub-expression
|
|
|
|
|
grouped with "()" for example.
|
|
|
|
|
</P>
|
|
|
|
|
<P>Examples:
|
|
|
|
|
</P>
|
|
|
|
|
<P>"ba*" will match all of "b", "ba", "baaa" etc.
|
|
|
|
|
</P>
|
|
|
|
|
<P>"ba+" will match "ba" or "baaaa" for example but not "b".
|
|
|
|
|
</P>
|
|
|
|
|
<P>"ba?" will match "b" or "ba".
|
|
|
|
|
</P>
|
|
|
|
|
<P>"ba{2,4}" will match "baa", "baaa" and "baaaa".
|
|
|
|
|
</P>
|
|
|
|
|
<H3>Non-greedy repeats
|
|
|
|
|
</H3>
|
|
|
|
|
<P>Whenever the "extended" regular expression syntax is in use (the default) then
|
|
|
|
|
non-greedy repeats are possible by appending a '?' after the repeat; a
|
|
|
|
|
non-greedy repeat is one which will match the <I>shortest</I> possible string.
|
|
|
|
|
</P>
|
|
|
|
|
<P>For example to match html tag pairs one could use something like:
|
|
|
|
|
</P>
|
|
|
|
|
<P>"<\s*tagname[^>]*>(.*?)<\s*/tagname\s*>"
|
|
|
|
|
</P>
|
|
|
|
|
<P>In this case $1 will contain the text between the tag pairs, and will be the
|
|
|
|
|
shortest possible matching string.
|
|
|
|
|
</P>
|
|
|
|
|
<H3>Parenthesis
|
|
|
|
|
</H3>
|
|
|
|
|
<P>Parentheses serve two purposes, to group items together into a sub-expression,
|
|
|
|
|
and to mark what generated the match. For example the expression "(ab)*" would
|
2003-05-24 11:13:26 +00:00
|
|
|
|
match all of the string "ababab". The matching algorithms <A href="regex_match.html">
|
|
|
|
|
regex_match</A> and <A href="regex_search.html">regex_search</A>
|
|
|
|
|
each take an instance of <A href="match_results.html">match_results</A>
|
|
|
|
|
that reports what caused the match, on exit from these functions the <A href="match_results.html">
|
2003-05-17 11:55:51 +00:00
|
|
|
|
match_results</A> contains information both on what the whole expression
|
|
|
|
|
matched and on what each sub-expression matched. In the example above
|
|
|
|
|
match_results[1] would contain a pair of iterators denoting the final "ab" of
|
|
|
|
|
the matching string. It is permissible for sub-expressions to match null
|
|
|
|
|
strings. If a sub-expression takes no part in a match - for example if it is
|
|
|
|
|
part of an alternative that is not taken - then both of the iterators that are
|
|
|
|
|
returned for that sub-expression point to the end of the input string, and the <I>matched</I>
|
|
|
|
|
parameter for that sub-expression is <I>false</I>. Sub-expressions are indexed
|
|
|
|
|
from left to right starting from 1, sub-expression 0 is the whole expression.
|
|
|
|
|
</P>
|
|
|
|
|
<H3>Non-Marking Parenthesis
|
|
|
|
|
</H3>
|
|
|
|
|
<P>Sometimes you need to group sub-expressions with parenthesis, but don't want
|
|
|
|
|
the parenthesis to spit out another marked sub-expression, in this case a
|
|
|
|
|
non-marking parenthesis (?:expression) can be used. For example the following
|
|
|
|
|
expression creates no sub-expressions:
|
|
|
|
|
</P>
|
|
|
|
|
<P>"(?:abc)*"</P>
|
|
|
|
|
<H3>Forward Lookahead Asserts
|
|
|
|
|
</H3>
|
|
|
|
|
<P>There are two forms of these; one for positive forward lookahead asserts, and
|
|
|
|
|
one for negative lookahead asserts:</P>
|
|
|
|
|
<P>"(?=abc)" matches zero characters only if they are followed by the expression
|
|
|
|
|
"abc".</P>
|
|
|
|
|
<P>"(?!abc)" matches zero characters only if they are not followed by the
|
|
|
|
|
expression "abc".</P>
|
|
|
|
|
<H3>Independent sub-expressions</H3>
|
|
|
|
|
<P>"(?>expression)" matches "expression" as an independent atom (the algorithm
|
|
|
|
|
will not backtrack into it if a failure occurs later in the expression).</P>
|
|
|
|
|
<H3>Alternatives
|
|
|
|
|
</H3>
|
|
|
|
|
<P>Alternatives occur when the expression can match either one sub-expression or
|
|
|
|
|
another, each alternative is separated by a "|", or a "\|" if the flag
|
|
|
|
|
regex_constants::bk_vbar is set, or by a newline character if the flag
|
|
|
|
|
regex_constants::newline_alt is set. Each alternative is the largest possible
|
|
|
|
|
previous sub-expression; this is the opposite behavior from repetition
|
|
|
|
|
operators.
|
|
|
|
|
</P>
|
|
|
|
|
<P>Examples:
|
|
|
|
|
</P>
|
|
|
|
|
<P>"a(b|c)" could match "ab" or "ac".
|
|
|
|
|
</P>
|
|
|
|
|
<P>"abc|def" could match "abc" or "def".
|
|
|
|
|
</P>
|
|
|
|
|
<H3>Sets
|
|
|
|
|
</H3>
|
|
|
|
|
<P>A set is a set of characters that can match any single character that is a
|
|
|
|
|
member of the set. Sets are delimited by "[" and "]" and can contain literals,
|
|
|
|
|
character ranges, character classes, collating elements and equivalence
|
|
|
|
|
classes. Set declarations that start with "^" contain the compliment of the
|
|
|
|
|
elements that follow.
|
|
|
|
|
</P>
|
|
|
|
|
<P>Examples:
|
|
|
|
|
</P>
|
|
|
|
|
<P>Character literals:
|
|
|
|
|
</P>
|
|
|
|
|
<P>"[abc]" will match either of "a", "b", or "c".
|
|
|
|
|
</P>
|
|
|
|
|
<P>"[^abc] will match any character other than "a", "b", or "c".
|
|
|
|
|
</P>
|
|
|
|
|
<P>Character ranges:
|
|
|
|
|
</P>
|
|
|
|
|
<P>"[a-z]" will match any character in the range "a" to "z".
|
|
|
|
|
</P>
|
|
|
|
|
<P>"[^A-Z]" will match any character other than those in the range "A" to "Z".
|
|
|
|
|
</P>
|
|
|
|
|
<P>Note that character ranges are highly locale dependent if the flag
|
|
|
|
|
regex_constants::collate is set: they match any character that collates between
|
|
|
|
|
the endpoints of the range, ranges will only behave according to ASCII rules
|
|
|
|
|
when the default "C" locale is in effect. For example if the library is
|
|
|
|
|
compiled with the Win32 localization model, then [a-z] will match the ASCII
|
|
|
|
|
characters a-z, and also 'A', 'B' etc, but not 'Z' which collates just after
|
|
|
|
|
'z'. This locale specific behavior is disabled by default (in perl mode), and
|
|
|
|
|
forces ranges to collate according to ASCII character code.
|
|
|
|
|
</P>
|
|
|
|
|
<P>Character classes are denoted using the syntax "[:classname:]" within a set
|
|
|
|
|
declaration, for example "[[:space:]]" is the set of all whitespace characters.
|
|
|
|
|
Character classes are only available if the flag regex_constants::char_classes
|
|
|
|
|
is set. The available character classes are:
|
|
|
|
|
<BR>
|
|
|
|
|
|
|
|
|
|
</P>
|
|
|
|
|
<P>
|
|
|
|
|
<TABLE id="Table2" cellSpacing="0" cellPadding="7" width="100%" border="0">
|
|
|
|
|
<TR>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
<TD vAlign="top" width="50%">alnum</TD>
|
|
|
|
|
<TD vAlign="top" width="50%">Any alpha numeric character.</TD>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="50%">alpha</TD>
|
|
|
|
|
<TD vAlign="top" width="50%">Any alphabetical character a-z and A-Z. Other
|
|
|
|
|
characters may also be included depending upon the locale.</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="50%">blank</TD>
|
|
|
|
|
<TD vAlign="top" width="50%">Any blank character, either a space or a tab.</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="50%">cntrl</TD>
|
|
|
|
|
<TD vAlign="top" width="50%">Any control character.</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="50%">digit</TD>
|
|
|
|
|
<TD vAlign="top" width="50%">Any digit 0-9.</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="50%">graph</TD>
|
|
|
|
|
<TD vAlign="top" width="50%">Any graphical character.</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="50%">lower</TD>
|
|
|
|
|
<TD vAlign="top" width="50%">Any lower case character a-z. Other characters may
|
|
|
|
|
also be included depending upon the locale.</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="50%">print</TD>
|
|
|
|
|
<TD vAlign="top" width="50%">Any printable character.</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="50%">punct</TD>
|
|
|
|
|
<TD vAlign="top" width="50%">Any punctuation character.</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="50%">space</TD>
|
|
|
|
|
<TD vAlign="top" width="50%">Any whitespace character.</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="50%">upper</TD>
|
|
|
|
|
<TD vAlign="top" width="50%">Any upper case character A-Z. Other characters may
|
|
|
|
|
also be included depending upon the locale.</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="50%">xdigit</TD>
|
|
|
|
|
<TD vAlign="top" width="50%">Any hexadecimal digit character, 0-9, a-f and A-F.</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="50%">word</TD>
|
|
|
|
|
<TD vAlign="top" width="50%">Any word character - all alphanumeric characters plus
|
|
|
|
|
the underscore.</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="50%">Unicode</TD>
|
|
|
|
|
<TD vAlign="top" width="50%">Any character whose code is greater than 255, this
|
|
|
|
|
applies to the wide character traits classes only.</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
</TABLE>
|
|
|
|
|
</P>
|
|
|
|
|
<P>There are some shortcuts that can be used in place of the character classes,
|
|
|
|
|
provided the flag regex_constants::escape_in_lists is set then you can use:
|
|
|
|
|
</P>
|
|
|
|
|
<P>\w in place of [:word:]
|
|
|
|
|
</P>
|
|
|
|
|
<P>\s in place of [:space:]
|
|
|
|
|
</P>
|
|
|
|
|
<P>\d in place of [:digit:]
|
|
|
|
|
</P>
|
|
|
|
|
<P>\l in place of [:lower:]
|
|
|
|
|
</P>
|
|
|
|
|
<P>\u in place of [:upper:]
|
|
|
|
|
</P>
|
|
|
|
|
<P>Collating elements take the general form [.tagname.] inside a set declaration,
|
|
|
|
|
where <I>tagname</I> is either a single character, or a name of a collating
|
|
|
|
|
element, for example [[.a.]] is equivalent to [a], and [[.comma.]] is
|
|
|
|
|
equivalent to [,]. The library supports all the standard POSIX collating
|
|
|
|
|
element names, and in addition the following digraphs: "ae", "ch", "ll", "ss",
|
|
|
|
|
"nj", "dz", "lj", each in lower, upper and title case variations.
|
|
|
|
|
Multi-character collating elements can result in the set matching more than one
|
|
|
|
|
character, for example [[.ae.]] would match two characters, but note that
|
|
|
|
|
[^[.ae.]] would only match one character.
|
|
|
|
|
</P>
|
|
|
|
|
<P>
|
|
|
|
|
Equivalence classes take the general form[=tagname=] inside a set declaration,
|
|
|
|
|
where <I>tagname</I> is either a single character, or a name of a collating
|
|
|
|
|
element, and matches any character that is a member of the same primary
|
|
|
|
|
equivalence class as the collating element [.tagname.]. An equivalence class is
|
|
|
|
|
a set of characters that collate the same, a primary equivalence class is a set
|
|
|
|
|
of characters whose primary sort key are all the same (for example strings are
|
|
|
|
|
typically collated by character, then by accent, and then by case; the primary
|
|
|
|
|
sort key then relates to the character, the secondary to the accentation, and
|
|
|
|
|
the tertiary to the case). If there is no equivalence class corresponding to <I>tagname</I>
|
|
|
|
|
, then[=tagname=] is exactly the same as [.tagname.]. Unfortunately there is no
|
|
|
|
|
locale independent method of obtaining the primary sort key for a character,
|
|
|
|
|
except under Win32. For other operating systems the library will "guess" the
|
|
|
|
|
primary sort key from the full sort key (obtained from <I>strxfrm</I>), so
|
|
|
|
|
equivalence classes are probably best considered broken under any operating
|
|
|
|
|
system other than Win32.
|
|
|
|
|
</P>
|
|
|
|
|
<P>To include a literal "-" in a set declaration then: make it the first character
|
|
|
|
|
after the opening "[" or "[^", the endpoint of a range, a collating element, or
|
|
|
|
|
if the flag regex_constants::escape_in_lists is set then precede with an escape
|
|
|
|
|
character as in "[\-]". To include a literal "[" or "]" or "^" in a set then
|
|
|
|
|
make them the endpoint of a range, a collating element, or precede with an
|
|
|
|
|
escape character if the flag regex_constants::escape_in_lists is set.
|
|
|
|
|
</P>
|
|
|
|
|
<H3>Line anchors
|
|
|
|
|
</H3>
|
|
|
|
|
<P>An anchor is something that matches the null string at the start or end of a
|
|
|
|
|
line: "^" matches the null string at the start of a line, "$" matches the null
|
|
|
|
|
string at the end of a line.
|
|
|
|
|
</P>
|
|
|
|
|
<H3>Back references
|
|
|
|
|
</H3>
|
|
|
|
|
<P>A back reference is a reference to a previous sub-expression that has already
|
|
|
|
|
been matched, the reference is to what the sub-expression matched, not to the
|
|
|
|
|
expression itself. A back reference consists of the escape character "\"
|
|
|
|
|
followed by a digit "1" to "9", "\1" refers to the first sub-expression, "\2"
|
|
|
|
|
to the second etc. For example the expression "(.*)\1" matches any string that
|
|
|
|
|
is repeated about its mid-point for example "abcabc" or "xyzxyz". A back
|
|
|
|
|
reference to a sub-expression that did not participate in any match, matches
|
|
|
|
|
the null string: NB this is different to some other regular expression
|
|
|
|
|
matchers. Back references are only available if the expression is compiled with
|
|
|
|
|
the flag regex_constants::bk_refs set.
|
|
|
|
|
</P>
|
|
|
|
|
<H3>Characters by code
|
|
|
|
|
</H3>
|
|
|
|
|
<P>This is an extension to the algorithm that is not available in other libraries,
|
|
|
|
|
it consists of the escape character followed by the digit "0" followed by the
|
|
|
|
|
octal character code. For example "\023" represents the character whose octal
|
|
|
|
|
code is 23. Where ambiguity could occur use parentheses to break the expression
|
|
|
|
|
up: "\0103" represents the character whose code is 103, "(\010)3 represents the
|
|
|
|
|
character 10 followed by "3". To match characters by their hexadecimal code,
|
|
|
|
|
use \x followed by a string of hexadecimal digits, optionally enclosed inside
|
|
|
|
|
{}, for example \xf0 or \x{aff}, notice the latter example is a Unicode
|
|
|
|
|
character.</P>
|
|
|
|
|
<H3>Word operators
|
|
|
|
|
</H3>
|
|
|
|
|
<P>The following operators are provided for compatibility with the GNU regular
|
|
|
|
|
expression library.
|
|
|
|
|
</P>
|
|
|
|
|
<P>"\w" matches any single character that is a member of the "word" character
|
|
|
|
|
class, this is identical to the expression "[[:word:]]".
|
|
|
|
|
</P>
|
|
|
|
|
<P>"\W" matches any single character that is not a member of the "word" character
|
|
|
|
|
class, this is identical to the expression "[^[:word:]]".
|
|
|
|
|
</P>
|
|
|
|
|
<P>"\<" matches the null string at the start of a word.
|
|
|
|
|
</P>
|
|
|
|
|
<P>"\>" matches the null string at the end of the word.
|
|
|
|
|
</P>
|
|
|
|
|
<P>"\b" matches the null string at either the start or the end of a word.
|
|
|
|
|
</P>
|
|
|
|
|
<P>"\B" matches a null string within a word.
|
|
|
|
|
</P>
|
|
|
|
|
<P>The start of the sequence passed to the matching algorithms is considered to be
|
|
|
|
|
a potential start of a word unless the flag match_not_bow is set. The end of
|
|
|
|
|
the sequence passed to the matching algorithms is considered to be a potential
|
|
|
|
|
end of a word unless the flag match_not_eow is set.
|
|
|
|
|
</P>
|
|
|
|
|
<H3>Buffer operators
|
|
|
|
|
</H3>
|
|
|
|
|
<P>The following operators are provided for compatibility with the GNU regular
|
|
|
|
|
expression library, and Perl regular expressions:
|
|
|
|
|
</P>
|
|
|
|
|
<P>"\`" matches the start of a buffer.
|
|
|
|
|
</P>
|
|
|
|
|
<P>"\A" matches the start of the buffer.
|
|
|
|
|
</P>
|
|
|
|
|
<P>"\'" matches the end of a buffer.
|
|
|
|
|
</P>
|
|
|
|
|
<P>"\z" matches the end of a buffer.
|
|
|
|
|
</P>
|
|
|
|
|
<P>"\Z" matches the end of a buffer, or possibly one or more new line characters
|
|
|
|
|
followed by the end of the buffer.
|
|
|
|
|
</P>
|
|
|
|
|
<P>A buffer is considered to consist of the whole sequence passed to the matching
|
|
|
|
|
algorithms, unless the flags match_not_bob or match_not_eob are set.
|
|
|
|
|
</P>
|
|
|
|
|
<H3>Escape operator
|
|
|
|
|
</H3>
|
|
|
|
|
<P>The escape character "\" has several meanings.
|
|
|
|
|
</P>
|
|
|
|
|
<P>Inside a set declaration the escape character is a normal character unless the
|
|
|
|
|
flag regex_constants::escape_in_lists is set in which case whatever follows the
|
|
|
|
|
escape is a literal character regardless of its normal meaning.
|
|
|
|
|
</P>
|
|
|
|
|
<P>The escape operator may introduce an operator for example: back references, or
|
|
|
|
|
a word operator.
|
|
|
|
|
</P>
|
|
|
|
|
<P>The escape operator may make the following character normal, for example "\*"
|
|
|
|
|
represents a literal "*" rather than the repeat operator.
|
|
|
|
|
</P>
|
|
|
|
|
<H4>Single character escape sequences
|
|
|
|
|
</H4>
|
|
|
|
|
<P>The following escape sequences are aliases for single characters:
|
|
|
|
|
<BR>
|
|
|
|
|
|
|
|
|
|
</P>
|
|
|
|
|
<P>
|
|
|
|
|
<TABLE id="Table3" cellSpacing="0" cellPadding="7" width="100%" border="0">
|
|
|
|
|
<TR>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
<TD vAlign="top" width="33%">Escape sequence
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">Character code
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">Meaning
|
|
|
|
|
</TD>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="33%">\a
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">0x07
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">Bell character.
|
|
|
|
|
</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="33%">\f
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">0x0C
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">Form feed.
|
|
|
|
|
</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="33%">\n
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">0x0A
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">Newline character.
|
|
|
|
|
</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="33%">\r
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">0x0D
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">Carriage return.
|
|
|
|
|
</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="33%">\t
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">0x09
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">Tab character.
|
|
|
|
|
</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="33%">\v
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">0x0B
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">Vertical tab.
|
|
|
|
|
</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="33%">\e
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">0x1B
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">ASCII Escape character.
|
|
|
|
|
</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="33%">\0dd
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">0dd
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">An octal character code, where <I>dd</I> is one or
|
|
|
|
|
more octal digits.
|
|
|
|
|
</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="33%">\xXX
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">0xXX
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">A hexadecimal character code, where XX is one or more
|
|
|
|
|
hexadecimal digits.
|
|
|
|
|
</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="33%">\x{XX}
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">0xXX
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">A hexadecimal character code, where XX is one or more
|
|
|
|
|
hexadecimal digits, optionally a Unicode character.
|
|
|
|
|
</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
<TD vAlign="top" width="33%">\cZ
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">z-@
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="33%">An ASCII escape sequence control-Z, where Z is any
|
|
|
|
|
ASCII character greater than or equal to the character code for '@'.
|
|
|
|
|
</TD>
|
|
|
|
|
<TD> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
</TABLE>
|
|
|
|
|
</P>
|
|
|
|
|
<H4>Miscellaneous escape sequences:
|
|
|
|
|
</H4>
|
|
|
|
|
<P>The following are provided mostly for perl compatibility, but note that there
|
|
|
|
|
are some differences in the meanings of \l \L \u and \U:
|
|
|
|
|
<BR>
|
|
|
|
|
|
|
|
|
|
</P>
|
|
|
|
|
<P>
|
|
|
|
|
<TABLE id="Table4" cellSpacing="0" cellPadding="6" width="100%" border="0">
|
|
|
|
|
<TR>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
<TD vAlign="top" width="45%">\w
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="45%">Equivalent to [[:word:]].
|
|
|
|
|
</TD>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
<TD vAlign="top" width="45%">\W
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="45%">Equivalent to [^[:word:]].
|
|
|
|
|
</TD>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
<TD vAlign="top" width="45%">\s
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="45%">Equivalent to [[:space:]].
|
|
|
|
|
</TD>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
<TD vAlign="top" width="45%">\S
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="45%">Equivalent to [^[:space:]].
|
|
|
|
|
</TD>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
<TD vAlign="top" width="45%">\d
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="45%">Equivalent to [[:digit:]].
|
|
|
|
|
</TD>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
<TD vAlign="top" width="45%">\D
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="45%">Equivalent to [^[:digit:]].
|
|
|
|
|
</TD>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
<TD vAlign="top" width="45%">\l
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="45%">Equivalent to [[:lower:]].
|
|
|
|
|
</TD>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
<TD vAlign="top" width="45%">\L
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="45%">Equivalent to [^[:lower:]].
|
|
|
|
|
</TD>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
<TD vAlign="top" width="45%">\u
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="45%">Equivalent to [[:upper:]].
|
|
|
|
|
</TD>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
<TD vAlign="top" width="45%">\U
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="45%">Equivalent to [^[:upper:]].
|
|
|
|
|
</TD>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
<TD vAlign="top" width="45%">\C
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="45%">Any single character, equivalent to '.'.
|
|
|
|
|
</TD>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
<TD vAlign="top" width="45%">\X
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="45%">Match any Unicode combining character sequence, for
|
|
|
|
|
example "a\x 0301" (a letter a with an acute).
|
|
|
|
|
</TD>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
<TD vAlign="top" width="45%">\Q
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="45%">The begin quote operator, everything that follows is
|
|
|
|
|
treated as a literal character until a \E end quote operator is found.
|
|
|
|
|
</TD>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
<TD vAlign="top" width="45%">\E
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="45%">The end quote operator, terminates a sequence begun
|
|
|
|
|
with \Q.
|
|
|
|
|
</TD>
|
|
|
|
|
<TD width="5%"> </TD>
|
|
|
|
|
</TR>
|
|
|
|
|
</TABLE>
|
|
|
|
|
</P>
|
|
|
|
|
<H3>What gets matched?
|
|
|
|
|
</H3>
|
|
|
|
|
<P>
|
|
|
|
|
When the expression is compiled as a Perl-compatible regex then the matching
|
|
|
|
|
algorithms will perform a depth first search on the state machine and report
|
|
|
|
|
the first match found.</P>
|
|
|
|
|
<P>
|
|
|
|
|
When the expression is compiled as a POSIX-compatible regex then the matching
|
|
|
|
|
algorithms will match the first possible matching string, if more than one
|
|
|
|
|
string starting at a given location can match then it matches the longest
|
|
|
|
|
possible string, unless the flag match_any is set, in which case the first
|
|
|
|
|
match encountered is returned. Use of the match_any option can reduce the time
|
|
|
|
|
taken to find the match - but is only useful if the user is less concerned
|
|
|
|
|
about what matched - for example it would not be suitable for search and
|
|
|
|
|
replace operations. In cases where their are multiple possible matches all
|
|
|
|
|
starting at the same location, and all of the same length, then the match
|
|
|
|
|
chosen is the one with the longest first sub-expression, if that is the same
|
|
|
|
|
for two or more matches, then the second sub-expression will be examined and so
|
|
|
|
|
on.
|
|
|
|
|
</P><P>
|
|
|
|
|
The following table examples illustrate the main differences between Perl and
|
|
|
|
|
POSIX regular expression matching rules:
|
|
|
|
|
</P>
|
|
|
|
|
<P>
|
|
|
|
|
<TABLE id="Table5" cellSpacing="1" cellPadding="7" width="624" border="1">
|
|
|
|
|
<TBODY>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD vAlign="top" width="25%">
|
|
|
|
|
<P>Expression</P>
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="25%">
|
|
|
|
|
<P>Text</P>
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="25%">
|
|
|
|
|
<P>POSIX leftmost longest match</P>
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="25%">
|
|
|
|
|
<P>ECMAScript depth first search match</P>
|
|
|
|
|
</TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD vAlign="top" width="25%">
|
|
|
|
|
<P><CODE>a|ab</CODE></P>
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="25%">
|
|
|
|
|
<P><CODE>
|
|
|
|
|
xaby</CODE>
|
|
|
|
|
</P>
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="25%">
|
|
|
|
|
<P><CODE>
|
|
|
|
|
"ab"</CODE></P></TD>
|
|
|
|
|
<TD vAlign="top" width="25%">
|
|
|
|
|
<P><CODE>
|
|
|
|
|
"a"</CODE></P></TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD vAlign="top" width="25%">
|
|
|
|
|
<P><CODE>
|
|
|
|
|
.*([[:alnum:]]+).*</CODE></P></TD>
|
|
|
|
|
<TD vAlign="top" width="25%">
|
|
|
|
|
<P><CODE>
|
|
|
|
|
" abc def xyz "</CODE></P></TD>
|
|
|
|
|
<TD vAlign="top" width="25%">
|
|
|
|
|
<P>$0 = " abc def xyz "<BR>
|
|
|
|
|
$1 = "abc"</P>
|
|
|
|
|
</TD>
|
|
|
|
|
<TD vAlign="top" width="25%">
|
|
|
|
|
<P>$0 = " abc def xyz "<BR>
|
|
|
|
|
$1 = "z"</P>
|
|
|
|
|
</TD>
|
|
|
|
|
</TR>
|
|
|
|
|
<TR>
|
|
|
|
|
<TD vAlign="top" width="25%">
|
|
|
|
|
<P><CODE>
|
|
|
|
|
.*(a|xayy)</CODE></P></TD>
|
|
|
|
|
<TD vAlign="top" width="25%">
|
|
|
|
|
<P><CODE>
|
|
|
|
|
zzxayyzz</CODE></P></TD>
|
|
|
|
|
<TD vAlign="top" width="25%">
|
|
|
|
|
<P><CODE>
|
|
|
|
|
"zzxayy"</CODE></P></TD>
|
|
|
|
|
<TD vAlign="top" width="25%">
|
|
|
|
|
<P><CODE>"zzxa"</CODE></P>
|
|
|
|
|
</TD>
|
|
|
|
|
</TR>
|
|
|
|
|
</TBODY></CODE></TD></TR></TABLE>
|
|
|
|
|
<P>These differences between Perl matching rules, and POSIX matching rules, mean
|
|
|
|
|
that these two regular expression syntaxes differ not only in the features
|
|
|
|
|
offered, but also in the form that the state machine takes and/or the
|
|
|
|
|
algorithms used to traverse the state machine.</p>
|
|
|
|
|
<HR>
|
|
|
|
|
<p>Revised
|
|
|
|
|
<!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%d %B, %Y" startspan -->
|
|
|
|
|
17 May 2003
|
|
|
|
|
<!--webbot bot="Timestamp" endspan i-checksum="39359" -->
|
|
|
|
|
</p>
|
|
|
|
|
<P><I><EFBFBD> Copyright <a href="mailto:jm@regex.fsnet.co.uk">John Maddock</a> 1998-<!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%Y" startspan --> 2003<!--webbot bot="Timestamp" endspan i-checksum="39359" --></I></P>
|
|
|
|
|
<P align="left"><I>Permission to use, copy, modify, distribute and sell this software
|
|
|
|
|
and its documentation for any purpose is hereby granted without fee, provided
|
|
|
|
|
that the above copyright notice appear in all copies and that both that
|
|
|
|
|
copyright notice and this permission notice appear in supporting documentation.
|
|
|
|
|
Dr John Maddock makes no representations about the suitability of this software
|
|
|
|
|
for any purpose. It is provided "as is" without express or implied warranty.</I></P>
|
|
|
|
|
</body>
|
|
|
|
|
</html>
|
|
|
|
|
|
|
|
|
|
|