diff --git a/syntax.htm b/syntax.htm index 45b260eb..a3af73ea 100644 --- a/syntax.htm +++ b/syntax.htm @@ -1,28 +1,28 @@ - - - - -
- - - -
-
- |
- Regex++, Regular Expression Syntax.-Copyright (c) 1998-2000 + + + + +
|
-
Literals -
-All characters are literals except: ".", "*", -"?", "+", "(", ")", "{", -"}", "[", "]", "^" and "$". -These characters are literals when preceded by a "\". A literal is a +
This section covers the regular expression syntax used by this +library, this is a programmers guide, the actual syntax presented +to your program's users will depend upon the flags used during +expression compilation.
+ +Literals
+ +All characters are literals except: ".",
+"*", "?", "+", "(",
+")", "{", "}", "[",
+"]", "^" and "$". These characters
+are literals when preceded by a "\". A literal is a
character that matches itself, or matches the result of
-traits_type::translate(), where traits_type is the traits template parameter to
-class reg_expression.
-
-
Wildcard
-The dot character "." matches any single character except : when
-match_not_dot_null is passed to the matching algorithms, the dot does
-not match a null character; when match_not_dot_newline is passed to the
-matching algorithms, then the dot does not match a newline character.
-
-
Repeats
-A repeat is an expression that is repeated an arbitrary number of times. An -expression followed by "*" can be repeated any number of times -including zero. An expression followed by "+" can be repeated any -number of times, but at least once, if the expression is compiled with the flag -regbase::bk_plus_qm then "+" is an ordinary character and -"\+" represents a repeat of once or more. An expression followed by -"?" may be repeated zero or one times only, if the expression is -compiled with the flag regbase::bk_plus_qm then "?" is an ordinary -character and "\?" represents the repeat zero or once operator. When -it is necessary to specify the minimum and maximum number of repeats -explicitly, the bounds operator "{}" may be used, thus -"a{2}" is the letter "a" repeated exactly twice, -"a{2,4}" represents the letter "a" repeated between 2 and 4 -times, and "a{2,}" represents the letter "a" repeated at -least twice with no upper limit. Note that there must be no white-space inside -the {}, and there is no upper limit on the values of the lower and upper -bounds. When the expression is compiled with the flag regbase::bk_braces then -"{" and "}" are ordinary characters and "\{" and -"\}" are used to delimit bounds instead. All repeat expressions refer -to the shortest possible previous sub-expression: a single character; a -character set, or a sub-expression grouped with "()" for example. -
-Examples:
-"ba*" will match all of "b", "ba", -"baaa" etc.
-"ba+" will match "ba" or "baaaa" for example -but not "b".
-"ba?" will match "b" or "ba".
-"ba{2,4}" will match "baa", "baaa" and -"baaaa".
-Non-greedy repeats
-Whenever the "extended" regular expression syntax is in use (the -default) then non-greedy repeats are possible by appending a '?' after the -repeat; a non-greedy repeat is one which will match the shortest -possible string.
-For example to match html tag pairs one could use something like:
-"<\s*tagname[^>]*>(.*?)<\s*/tagname\s*>"
-In this case $1 will contain the text between the tag pairs, and will be the
-shortest possible matching string.
-
-
Parenthesis
-Parentheses serve two purposes, to group items together into a -sub-expression, and to mark what generated the match. For example the -expression "(ab)*" would match all of the string "ababab". -The matching algorithms regex_match and -regex_search each take an -instance of match_results that -reports what caused the match, on exit from these functions the -match_results contains information -both on what the whole expression matched and on what each sub-expression -matched. In the example above match_results[1] would contain a pair of iterators -denoting the final "ab" of the matching string. It is permissible for -sub-expressions to match null strings. If a sub-expression takes no part in a -match - for example if it is part of an alternative that is not taken - then -both of the iterators that are returned for that sub-expression point to the -end of the input string, and the matched parameter for that -sub-expression is false. Sub-expressions are indexed from left to right -starting from 1, sub-expression 0 is the whole expression.
-Non-Marking Parenthesis
-Sometimes you need to group sub-expressions with parenthesis, but don't want -the parenthesis to spit out another marked sub-expression, in this case a -non-marking parenthesis (?:expression) can be used. For example the following -expression creates no sub-expressions:
-"(?:abc)*"
-
-
Alternatives
-Alternatives occur when the expression can match either one sub-expression -or another, each alternative is separated by a "|", or a -"\|" if the flag regbase::bk_vbar is set, or by a newline character -if the flag regbase::newline_alt is set. Each alternative is the largest -possible previous sub-expression; this is the opposite behaviour from -repetition operators.
-Examples:
-"a(b|c)" could match "ab" or "ac".
-"abc|def" could match "abc" or "def".
-
-
Sets
-A set is a set of characters that can match any single character that is a -member of the set. Sets are delimited by "[" and "]" and -can contain literals, character ranges, character classes, collating elements -and equivalence classes. Set declarations that start with "^" contain -the compliment of the elements that follow.
-Examples:
-Character literals:
-"[abc]" will match either of "a", "b", or -"c".
-"[^abc] will match any character other than "a", -"b", or "c".
-Character ranges:
-"[a-z]" will match any character in the range "a" to -"z".
-"[^A-Z]" will match any character other than those in the range -"A" to "Z".
-Note that character ranges are highly locale dependent: they match any -character that collates between the endpoints of the range, ranges will only -behave according to ASCII rules when the default "C" locale is in -effect. For example if the library is compiled with the Win32 localization -model, then [a-z] will match the ASCII characters a-z, and also 'A', 'B' etc, -but not 'Z' which collates just after 'z'. This locale specific behaviour can -be disabled by specifying regbase::nocollate when compiling, this is the -default behaviour when using regbase::normal, and forces ranges to collate -according to ASCII character code. Likewise, if you use the POSIX C API -functions then setting REG_NOCOLLATE turns off locale dependent collation.
-Character classes are denoted using the syntax "[:classname:]"
-within a set declaration, for example "[[:space:]]" is the set of all
-whitespace characters. Character classes are only available if the flag
-regbase::char_classes is set. The available character classes are:
-
- | alnum | -Any alpha numeric character. | -- |
- | alpha | -Any alphabetical character a-z and A-Z. Other -characters may also be included depending upon the locale. | -- |
- | blank | -Any blank character, either a space or a tab. | -- |
- | cntrl | -Any control character. | -- |
- | digit | -Any digit 0-9. | -- |
- | graph | -Any graphical character. | -- |
- | lower | -Any lower case character a-z. Other characters may -also be included depending upon the locale. | -- |
- | Any printable character. | -- | |
- | punct | -Any punctuation character. | -- |
- | space | -Any whitespace character. | -- |
- | upper | -Any upper case character A-Z. Other characters may -also be included depending upon the locale. | -- |
- | xdigit | -Any hexadecimal digit character, 0-9, a-f and -A-F. | -- |
- | word | -Any word character - all alphanumeric characters -plus the underscore. | -- |
- | unicode | -Any character whose code is greater than 255, this -applies to the wide character traits classes only. | -- |
There are some shortcuts that can be used in place of the character classes, -provided the flag regbase::escape_in_lists is set then you can use:
-\w in place of [:word:]
-\s in place of [:space:]
-\d in place of [:digit:]
-\l in place of [:lower:]
-\u in place of [:upper:]
-
-
Collating elements take the general form [.tagname.] inside a set
-declaration, where tagname is either a single character, or a name of a
-collating element, for example [[.a.]] is equivalent to [a], and [[.comma.]] is
-equivalent to [,]. The library supports all the standard POSIX collating
-element names, and in addition the following digraphs: "ae",
-"ch", "ll", "ss", "nj", "dz",
-"lj", each in lower, upper and title case variations. Multi-character
-collating elements can result in the set matching more than one character, for
-example [[.ae.]] would match two characters, but note that [^[.ae.]] would only
-match one character.
-
-
Equivalence classes take the general form [=tagname=] inside a set
-declaration, where tagname is either a single character, or a name of a
-collating element, and matches any character that is a member of the same
-primary equivalence class as the collating element [.tagname.]. An equivalence
-class is a set of characters that collate the same, a primary equivalence class
-is a set of characters whose primary sort key are all the same (for example
-strings are typically collated by character, then by accent, and then by case;
-the primary sort key then relates to the character, the secondary to the
-accentation, and the tertiary to the case). If there is no equivalence class
-corresponding to tagname, then [=tagname=] is exactly the same as
-[.tagname.]. Unfortunately there is no locale independent method of obtaining
-the primary sort key for a character, except under Win32. For other operating
-systems the library will "guess" the primary sort key from the full
-sort key (obtained from strxfrm), so equivalence classes are probably
-best considered broken under any operating system other than Win32.
-
-
To include a literal "-" in a set declaration then: make it the
-first character after the opening "[" or "[^", the endpoint
-of a range, a collating element, or if the flag regbase::escape_in_lists is set
-then precede with an escape character as in "[\-]". To include a
-literal "[" or "]" or "^" in a set then make them
-the endpoint of a range, a collating element, or precede with an escape
-character if the flag regbase::escape_in_lists is set.
-
-
Line anchors
-An anchor is something that matches the null string at the start or end of a
-line: "^" matches the null string at the start of a line,
-"$" matches the null string at the end of a line.
-
-
Back references
-A back reference is a reference to a previous sub-expression that has
-already been matched, the reference is to what the sub-expression matched, not
-to the expression itself. A back reference consists of the escape character
-"\" followed by a digit "1" to "9",
-"\1" refers to the first sub-expression, "\2" to the second
-etc. For example the expression "(.*)\1" matches any string that is
-repeated about its mid-point for example "abcabc" or
-"xyzxyz". A back reference to a sub-expression that did not
-participate in any match, matches the null string: NB this is different to some
-other regular expression matchers. Back references are only available if the
-expression is compiled with the flag regbase::bk_refs set.
-
-
Characters by code
-This is an extension to the algorithm that is not available in other
-libraries, it consists of the escape character followed by the digit
-"0" followed by the octal character code. For example
-"\023" represents the character whose octal code is 23. Where
-ambiguity could occur use parentheses to break the expression up:
-"\0103" represents the character whose code is 103, "(\010)3
-represents the character 10 followed by "3". To match characters by
-their hexadecimal code, use \x followed by a string of hexadecimal digits,
-optionally enclosed inside {}, for example \xf0 or \x{aff}, notice the latter
-example is a Unicode character.
-
-
Word operators
-The following operators are provided for compatibility with the GNU regular -expression library.
-"\w" matches any single character that is a member of the -"word" character class, this is identical to the expression -"[[:word:]]".
-"\W" matches any single character that is not a member of the -"word" character class, this is identical to the expression -"[^[:word:]]".
-"\<" matches the null string at the start of a word.
-"\>" matches the null string at the end of the word.
-"\b" matches the null string at either the start or the end of a -word.
-"\B" matches a null string within a word.
-The start of the sequence passed to the matching algorithms is considered to
-be a potential start of a word unless the flag match_not_bow is set. The end of
-the sequence passed to the matching algorithms is considered to be a potential
-end of a word unless the flag match_not_eow is set.
-
-
Buffer operators
-The following operators are provide for compatibility with the GNU regular -expression library, and Perl regular expressions:
-"\`" matches the start of a buffer.
-"\A" matches the start of the buffer.
-"\'" matches the end of a buffer.
-"\z" matches the end of a buffer.
-"\Z" matches the end of a buffer, or possibly one or more new line -characters followed by the end of the buffer.
-A buffer is considered to consist of the whole sequence passed to the
-matching algorithms, unless the flags match_not_bob or match_not_eob are set.
-
-
-
Escape operator
-The escape character "\" has several meanings.
-Inside a set declaration the escape character is a normal character unless -the flag regbase::escape_in_lists is set in which case whatever follows the -escape is a literal character regardless of its normal meaning.
-The escape operator may introduce an operator for example: back references, -or a word operator.
-The escape operator may make the following character normal, for example
-"\*" represents a literal "*" rather than the repeat
-operator.
-
-
Single character escape sequences
-The following escape sequences are aliases for single characters:
-
- | |
- |
- |
-- |
- | |
- |
- |
-- |
- | |
- |
- |
-- |
- | |
- |
- |
-- |
- | |
- |
- |
-- |
- | |
- |
- |
-- |
- | |
- |
- |
-- |
- | |
- |
- |
-- |
- | |
- |
- |
-- |
- | |
- |
- |
-- |
- | |
- |
- |
-- |
- | |
- |
- |
-- |
Miscellaneous escape sequences:
-The following are provided mostly for perl compatibility, but note that
-there are some differences in the meanings of \l \L \u and \U:
-
- | |
- |
-- |
- | |
- |
-- |
- | |
- |
-- |
- | |
- |
-- |
- | |
- |
-- |
- | |
- |
-- |
- | |
- |
-- |
- | |
- |
-- |
- | |
- |
-- |
- | |
- |
-- |
- | |
- |
-- |
- | |
- |
-- |
- | |
- |
-- |
- | |
- |
-- |
What gets matched?
-The regular expression library will match the first possible matching
-string, if more than one string starting at a given location can match then it
-matches the longest possible string, unless the flag match_any is set, in which
-case the first match encountered is returned. Use of the match_any option can
-reduce the time taken to find the match - but is only useful if the user is
-less concerned about what matched - for example it would not be suitable for
-search and replace operations. In cases where their are multiple possible
-matches all starting at the same location, and all of the same length, then the
-match chosen is the one with the longest first sub-expression, if that is the
-same for two or more matches, then the second sub-expression will be examined
-and so on.
-
Copyright Dr John -Maddock 1998-2000 all rights reserved.
- - +traits_type::translate(), where traits_type is the traits +template parameter to class reg_expression.Wildcard
+ +The dot character "." matches any single character
+except : when match_not_dot_null is passed to the matching
+algorithms, the dot does not match a null character; when match_not_dot_newline
+is passed to the matching algorithms, then the dot does not match
+a newline character.
+
+
Repeats
+ +A repeat is an expression that is repeated an arbitrary number +of times. An expression followed by "*" can be repeated +any number of times including zero. An expression followed by +"+" can be repeated any number of times, but at least +once, if the expression is compiled with the flag regbase::bk_plus_qm +then "+" is an ordinary character and "\+" +represents a repeat of once or more. An expression followed by +"?" may be repeated zero or one times only, if the +expression is compiled with the flag regbase::bk_plus_qm then +"?" is an ordinary character and "\?" +represents the repeat zero or once operator. When it is necessary +to specify the minimum and maximum number of repeats explicitly, +the bounds operator "{}" may be used, thus "a{2}" +is the letter "a" repeated exactly twice, "a{2,4}" +represents the letter "a" repeated between 2 and 4 +times, and "a{2,}" represents the letter "a" +repeated at least twice with no upper limit. Note that there must +be no white-space inside the {}, and there is no upper limit on +the values of the lower and upper bounds. When the expression is +compiled with the flag regbase::bk_braces then "{" and +"}" are ordinary characters and "\{" and +"\}" are used to delimit bounds instead. All repeat +expressions refer to the shortest possible previous sub-expression: +a single character; a character set, or a sub-expression grouped +with "()" for example.
+ +Examples:
+ +"ba*" will match all of "b", "ba", +"baaa" etc.
+ +"ba+" will match "ba" or "baaaa" +for example but not "b".
+ +"ba?" will match "b" or "ba".
+ +"ba{2,4}" will match "baa", "baaa" +and "baaaa".
+ +Non-greedy repeats
+ +Whenever the "extended" regular expression syntax is +in use (the default) then non-greedy repeats are possible by +appending a '?' after the repeat; a non-greedy repeat is one +which will match the shortest possible string.
+ +For example to match html tag pairs one could use something +like:
+ +"<\s*tagname[^>]*>(.*?)<\s*/tagname\s*>" +
+ +In this case $1 will contain the text between the tag pairs,
+and will be the shortest possible matching string.
+
+
Parenthesis
+ +Parentheses serve two purposes, to group items together into a +sub-expression, and to mark what generated the match. For example +the expression "(ab)*" would match all of the string +"ababab". The matching algorithms regex_match and regex_search each +take an instance of match_results +that reports what caused the match, on exit from these functions +the match_results +contains information both on what the whole expression matched +and on what each sub-expression matched. In the example above +match_results[1] would contain a pair of iterators denoting the +final "ab" of the matching string. It is permissible +for sub-expressions to match null strings. If a sub-expression +takes no part in a match - for example if it is part of an +alternative that is not taken - then both of the iterators that +are returned for that sub-expression point to the end of the +input string, and the matched parameter for that sub-expression +is false. Sub-expressions are indexed from left to right +starting from 1, sub-expression 0 is the whole expression.
+ +Non-Marking Parenthesis
+ +Sometimes you need to group sub-expressions with parenthesis, +but don't want the parenthesis to spit out another marked sub-expression, +in this case a non-marking parenthesis (?:expression) can be used. +For example the following expression creates no sub-expressions:
+ +"(?:abc)*"
+
+
Alternatives
+ +Alternatives occur when the expression can match either one +sub-expression or another, each alternative is separated by a +"|", or a "\|" if the flag regbase::bk_vbar +is set, or by a newline character if the flag regbase::newline_alt +is set. Each alternative is the largest possible previous sub-expression; +this is the opposite behaviour from repetition operators.
+ +Examples:
+ +"a(b|c)" could match "ab" or "ac". +
+ +"abc|def" could match "abc" or "def".
+
+
+
Sets
+ +A set is a set of characters that can match any single +character that is a member of the set. Sets are delimited by +"[" and "]" and can contain literals, +character ranges, character classes, collating elements and +equivalence classes. Set declarations that start with "^" +contain the compliment of the elements that follow.
+ +Examples:
+ +Character literals:
+ +"[abc]" will match either of "a", "b", +or "c".
+ +"[^abc] will match any character other than "a", +"b", or "c".
+ +Character ranges:
+ +"[a-z]" will match any character in the range "a" +to "z".
+ +"[^A-Z]" will match any character other than those +in the range "A" to "Z".
+ +Note that character ranges are highly locale dependent: they +match any character that collates between the endpoints of the +range, ranges will only behave according to ASCII rules when the +default "C" locale is in effect. For example if the +library is compiled with the Win32 localization model, then [a-z] +will match the ASCII characters a-z, and also 'A', 'B' etc, but +not 'Z' which collates just after 'z'. This locale specific +behaviour can be disabled by specifying regbase::nocollate when +compiling, this is the default behaviour when using regbase::normal, +and forces ranges to collate according to ASCII character code. +Likewise, if you use the POSIX C API functions then setting +REG_NOCOLLATE turns off locale dependent collation.
+ +Character classes are denoted using the syntax "[:classname:]"
+within a set declaration, for example "[[:space:]]" is
+the set of all whitespace characters. Character classes are only
+available if the flag regbase::char_classes is set. The available
+character classes are:
+
+ | alnum | +Any alpha numeric character. | ++ |
+ | alpha | +Any alphabetical character a-z + and A-Z. Other characters may also be included depending + upon the locale. | ++ |
+ | blank | +Any blank character, either + a space or a tab. | ++ |
+ | cntrl | +Any control character. | ++ |
+ | digit | +Any digit 0-9. | ++ |
+ | graph | +Any graphical character. | ++ |
+ | lower | +Any lower case character a-z. + Other characters may also be included depending upon the + locale. | ++ |
+ | Any printable character. | ++ | |
+ | punct | +Any punctuation character. | ++ |
+ | space | +Any whitespace character. | ++ |
+ | upper | +Any upper case character A-Z. + Other characters may also be included depending upon the + locale. | ++ |
+ | xdigit | +Any hexadecimal digit + character, 0-9, a-f and A-F. | ++ |
+ | word | +Any word character - all + alphanumeric characters plus the underscore. | ++ |
+ | unicode | +Any character whose code is + greater than 255, this applies to the wide character + traits classes only. | ++ |
There are some shortcuts that can be used in place of the +character classes, provided the flag regbase::escape_in_lists is +set then you can use:
+ +\w in place of [:word:]
+ +\s in place of [:space:]
+ +\d in place of [:digit:]
+ +\l in place of [:lower:]
+ +\u in place of [:upper:]
+
+
Collating elements take the general form [.tagname.] inside a
+set declaration, where tagname is either a single
+character, or a name of a collating element, for example [[.a.]]
+is equivalent to [a], and [[.comma.]] is equivalent to [,]. The
+library supports all the standard POSIX collating element names,
+and in addition the following digraphs: "ae", "ch",
+"ll", "ss", "nj", "dz",
+"lj", each in lower, upper and title case variations.
+Multi-character collating elements can result in the set matching
+more than one character, for example [[.ae.]] would match two
+characters, but note that [^[.ae.]] would only match one
+character.
+
+
Equivalence classes take the general form [=tagname=] inside a
+set declaration, where tagname is either a single
+character, or a name of a collating element, and matches any
+character that is a member of the same primary equivalence class
+as the collating element [.tagname.]. An equivalence class is a
+set of characters that collate the same, a primary equivalence
+class is a set of characters whose primary sort key are all the
+same (for example strings are typically collated by character,
+then by accent, and then by case; the primary sort key then
+relates to the character, the secondary to the accentation, and
+the tertiary to the case). If there is no equivalence class
+corresponding to tagname, then [=tagname=] is exactly the
+same as [.tagname.]. Unfortunately there is no locale independent
+method of obtaining the primary sort key for a character, except
+under Win32. For other operating systems the library will "guess"
+the primary sort key from the full sort key (obtained from strxfrm),
+so equivalence classes are probably best considered broken under
+any operating system other than Win32.
+
+
To include a literal "-" in a set declaration then:
+make it the first character after the opening "[" or
+"[^", the endpoint of a range, a collating element, or
+if the flag regbase::escape_in_lists is set then precede with an
+escape character as in "[\-]". To include a literal
+"[" or "]" or "^" in a set then
+make them the endpoint of a range, a collating element, or
+precede with an escape character if the flag regbase::escape_in_lists
+is set.
+
+
Line anchors
+ +An anchor is something that matches the null string at the
+start or end of a line: "^" matches the null string at
+the start of a line, "$" matches the null string at the
+end of a line.
+
+
Back references
+ +A back reference is a reference to a previous sub-expression
+that has already been matched, the reference is to what the sub-expression
+matched, not to the expression itself. A back reference consists
+of the escape character "\" followed by a digit "1"
+to "9", "\1" refers to the first sub-expression,
+"\2" to the second etc. For example the expression
+"(.*)\1" matches any string that is repeated about its
+mid-point for example "abcabc" or "xyzxyz". A
+back reference to a sub-expression that did not participate in
+any match, matches the null string: NB this is different to some
+other regular expression matchers. Back references are only
+available if the expression is compiled with the flag regbase::bk_refs
+set.
+
+
Characters by code
+ +This is an extension to the algorithm that is not available in
+other libraries, it consists of the escape character followed by
+the digit "0" followed by the octal character code. For
+example "\023" represents the character whose octal
+code is 23. Where ambiguity could occur use parentheses to break
+the expression up: "\0103" represents the character
+whose code is 103, "(\010)3 represents the character 10
+followed by "3". To match characters by their
+hexadecimal code, use \x followed by a string of hexadecimal
+digits, optionally enclosed inside {}, for example \xf0 or
+\x{aff}, notice the latter example is a Unicode character.
+
+
Word operators
+ +The following operators are provided for compatibility with +the GNU regular expression library.
+ +"\w" matches any single character that is a member +of the "word" character class, this is identical to the +expression "[[:word:]]".
+ +"\W" matches any single character that is not a +member of the "word" character class, this is identical +to the expression "[^[:word:]]".
+ +"\<" matches the null string at the start of a +word.
+ +"\>" matches the null string at the end of the +word.
+ +"\b" matches the null string at either the start or +the end of a word.
+ +"\B" matches a null string within a word.
+ +The start of the sequence passed to the matching algorithms is
+considered to be a potential start of a word unless the flag
+match_not_bow is set. The end of the sequence passed to the
+matching algorithms is considered to be a potential end of a word
+unless the flag match_not_eow is set.
+
+
Buffer operators
+ +The following operators are provide for compatibility with the +GNU regular expression library, and Perl regular expressions:
+ +"\`" matches the start of a buffer.
+ +"\A" matches the start of the buffer.
+ +"\'" matches the end of a buffer.
+ +"\z" matches the end of a buffer.
+ +"\Z" matches the end of a buffer, or possibly one or +more new line characters followed by the end of the buffer.
+ +A buffer is considered to consist of the whole sequence passed
+to the matching algorithms, unless the flags match_not_bob or
+match_not_eob are set.
+
+
Escape operator
+ +The escape character "\" has several meanings.
+ +Inside a set declaration the escape character is a normal +character unless the flag regbase::escape_in_lists is set in +which case whatever follows the escape is a literal character +regardless of its normal meaning.
+ +The escape operator may introduce an operator for example: +back references, or a word operator.
+ +The escape operator may make the following character normal,
+for example "\*" represents a literal "*"
+rather than the repeat operator.
+
+
Single character escape sequences
+ +The following escape sequences are aliases for single
+characters:
+
+ | Escape sequence | +Character code | +Meaning | ++ |
+ | \a | +0x07 | +Bell character. | ++ |
+ | \f | +0x0C | +Form feed. | ++ |
+ | \n | +0x0A | +Newline character. | ++ |
+ | \r | +0x0D | +Carriage return. | ++ |
+ | \t | +0x09 | +Tab character. | ++ |
+ | \v | +0x0B | +Vertical tab. | ++ |
+ | \e | +0x1B | +ASCII Escape character. | ++ |
+ | \0dd | +0dd | +An octal character code, + where dd is one or more octal digits. | ++ |
+ | \xXX | +0xXX | +A hexadecimal character + code, where XX is one or more hexadecimal digits. | ++ |
+ | \x{XX} | +0xXX | +A hexadecimal character + code, where XX is one or more hexadecimal digits, + optionally a unicode character. | ++ |
+ | \cZ | +z-@ | +An ASCII escape sequence + control-Z, where Z is any ASCII character greater than or + equal to the character code for '@'. | ++ |
+
Miscellaneous escape sequences:
+ +The following are provided mostly for perl compatibility, but
+note that there are some differences in the meanings of \l \L \u
+and \U:
+
+ | \w | +Equivalent to [[:word:]]. | ++ |
+ | \W | +Equivalent to [^[:word:]]. | ++ |
+ | \s | +Equivalent to [[:space:]]. | ++ |
+ | \S | +Equivalent to [^[:space:]]. | ++ |
+ | \d | +Equivalent to [[:digit:]]. | ++ |
+ | \D | +Equivalent to [^[:digit:]]. | ++ |
+ | \l | +Equivalent to [[:lower:]]. | ++ |
+ | \L | +Equivalent to [^[:lower:]]. | ++ |
+ | \u | +Equivalent to [[:upper:]]. | ++ |
+ | \U | +Equivalent to [^[:upper:]]. | ++ |
+ | \C | +Any single character, + equivalent to '.'. | ++ |
+ | \X | +Match any Unicode combining + character sequence, for example "a\x 0301" (a + letter a with an acute). | ++ |
+ | \Q | +The begin quote operator, + everything that follows is treated as a literal character + until a \E end quote operator is found. | ++ |
+ | \E | +The end quote operator, + terminates a sequence begun with \Q. | ++ |
+
What gets matched?
+ +The regular expression library will match the first possible
+matching string, if more than one string starting at a given
+location can match then it matches the longest possible string,
+unless the flag match_any is set, in which case the first match
+encountered is returned. Use of the match_any option can reduce
+the time taken to find the match - but is only useful if the user
+is less concerned about what matched - for example it would not
+be suitable for search and replace operations. In cases where
+their are multiple possible matches all starting at the same
+location, and all of the same length, then the match chosen is
+the one with the longest first sub-expression, if that is the
+same for two or more matches, then the second sub-expression will
+be examined and so on.
+
Copyright Dr +John Maddock 1998-2000 all rights reserved.
+ +