Files
boost_regex/doc/Attic/icu_strings.html

468 lines
24 KiB
HTML
Raw Normal View History

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>Boost.Regex: Working With Unicode and ICU String Types</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<LINK href="../../../boost.css" type="text/css" rel="stylesheet"></head>
<body>
<P>
<TABLE id="Table1" cellSpacing="1" cellPadding="1" width="100%" border="0">
<TR>
<td vAlign="top" width="300">
<h3><A href="../../../index.htm"><IMG height="86" alt="C++ Boost" src="../../../boost.png" width="277" border="0"></A></h3>
</td>
<TD width="353">
<H1 align="center">Boost.Regex</H1>
<H2 align="center">Working With Unicode and ICU String Types.</H2>
</TD>
<td width="50">
<h3><A href="index.html"><IMG height="45" alt="Boost.Regex Index" src="uarrow.gif" width="43" border="0"></A></h3>
</td>
</TR>
</TABLE>
</P>
<HR>
<p></p>
<H3>Contents</H3>
<dl class="index">
<dt><a href="#introduction">Introduction</a></dt>
<dt><a href="#types">Unicode regular expression types</a></dt>
<dt><a href="#algo">Regular Expression Algorithms</a>
<dd>
<dl class="index">
<dt><a href="#u32regex_match">u32regex_match</a></dt>
<dt><a href="#u32regex_search">u32regex_search</a></dt>
<dt><a href="#u32regex_replace">u32regex_replace</a></dt>
</dl>
</dd>
</dt>
<dt><a href="#iterators">Iterators</a>
<dd>
<dl class="index">
<dt><a href="#u32regex_iterator">u32regex_iterator</a></dt>
<dt><a href="#u32regex_token_iterator">u32regex_token_iterator</a></dt>
</dl>
</dd>
</dt>
</dl>
<H3><A name="introduction"></A>Introduction</H3>
<P>The header:</P>
<PRE>&lt;boost/regex/icu.hpp&gt;</PRE>
<P>contains the data types and algorithms necessary for working with regular
expressions in a Unicode aware environment.&nbsp;
</P>
<P>In order to use this header you will need <A href="http://www.ibm.com/software/globalization/icu/">
the ICU library</A>, and you will need to have built the Boost.Regex library
with <A href="install.html#unicode">ICU support enabled</A>.</P>
<P>The header will enable you to:</P>
<UL>
<LI>
Create regular expressions that treat Unicode strings as sequences of UTF-32
code points.
<LI>
Create regular expressions that support various Unicode data properties,
including character classification.
<LI>
Transparently search Unicode strings that are encoded as either UTF-8, UTF-16
or UTF-32.</LI></UL>
<H3><A name="types"></A>Unicode regular expression types</H3>
<P>Header &lt;boost/regex/icu.hpp&gt; provides a regular expression&nbsp;traits
class that handles UTF-32 characters:</P>
<PRE>class icu_regex_traits;</PRE>
<P>and a regular expression type based upon that:</P>
<PRE>typedef basic_regex&lt;UChar32,icu_regex_traits&gt; u32regex;</PRE>
<P>The type <EM>u32regex</EM> is regular expression type to use for all Unicode
regular expressions; internally it uses UTF-32 code points, but can be created
from, and used to search, either UTF-8, or UTF-16 encoded strings as well as
UTF-32 ones.</P>
<P>The <A href="basic_regex.html#c2">constructors</A>, and <A href="basic_regex.html#a1">
assign</A> member functions of u32regex, require UTF-32 encoded strings, but
there are a series of overloaded algorithms called make_u32regex which allow
regular expressions to be created from UTF-8, UTF-16, or UTF-32 encoded
strings:</P>
<PRE>template &lt;class InputIterator&gt;
u32regex make_u32regex(InputIterator i, InputIterator j, boost::regex_constants::syntax_option_type opt);
</PRE>
<P><STRONG>Effects:</STRONG> Creates a regular expression object from the iterator
sequence [i,j). The character encoding of the sequence is determined based upon <code>
sizeof(*i)</code>: 1 implies UTF-8, 2 implies UTF-16, and 4 implies UTF-32.</P>
<PRE>u32regex make_u32regex(const char* p, boost::regex_constants::syntax_option_type opt = boost::regex_constants::perl);
</PRE>
<P><STRONG>Effects:</STRONG> Creates a regular expression object from the
Null-terminated UTF-8 characater sequence <EM>p</EM>.</P>
<PRE>u32regex make_u32regex(const unsigned char* p, boost::regex_constants::syntax_option_type opt = boost::regex_constants::perl);</PRE>
<P><STRONG>Effects:</STRONG> Creates a regular expression object from the
Null-terminated UTF-8 characater sequence <EM>p</EM>.u32regex
make_u32regex(const wchar_t* p, boost::regex_constants::syntax_option_type opt
= boost::regex_constants::perl);</P>
<P><STRONG>Effects:</STRONG> Creates a regular expression object from the
Null-terminated characater sequence <EM>p</EM>.&nbsp; The character encoding of
the sequence is determined based upon <CODE>sizeof(wchar_t)</CODE>: 1 implies
UTF-8, 2 implies UTF-16, and 4 implies UTF-32.</P>
<PRE>u32regex make_u32regex(const UChar* p, boost::regex_constants::syntax_option_type opt = boost::regex_constants::perl);</PRE>
<P><STRONG>Effects:</STRONG> Creates a regular expression object from the
Null-terminated UTF-16 characater sequence <EM>p</EM>.</P>
<PRE>template&lt;class C, class T, class A&gt;
u32regex make_u32regex(const std::basic_string&lt;C, T, A&gt;&amp; s, boost::regex_constants::syntax_option_type opt = boost::regex_constants::perl);</PRE>
<P><STRONG>Effects:</STRONG> Creates a regular expression object from the string <EM>s</EM>.&nbsp;
The character encoding of the string is determined based upon <CODE>sizeof(C)</CODE>:
1 implies UTF-8, 2 implies UTF-16, and 4 implies UTF-32.</P>
<PRE>u32regex make_u32regex(const UnicodeString&amp; s, boost::regex_constants::syntax_option_type opt = boost::regex_constants::perl);</PRE>
<P><STRONG>Effects:</STRONG> Creates a regular expression object from the UTF-16
encoding string <EM>s</EM>.</P>
<H3><A name="algo"></A>Regular Expression Algorithms</H3>
<P>The regular expression algorithms <A href="regex_match.html">regex_match</A>, <A href="regex_search.html">
regex_search</A> and <A href="regex_replace.html">regex_replace</A> all
expect that the character sequence upon which they operate, is encoded in the
same character encoding as the regular expression object with which they are
used.&nbsp; For Unicode regular expressions that behavior is undesirable: while
we may want to process the data in UTF-32 "chunks", the actual data is much
more likely to encoded as either UTF-8 or UTF-16.&nbsp; Therefore the header
&lt;boost/regex/icu.hpp&gt; provides a series of thin wrappers around these
algorithms, called u32regex_match, u32regex_search, and u32regex_replace.&nbsp;
These wrappers use iterator-adapters internally to make external UTF-8 or
UTF-16 data look as though it's really a UTF-32 sequence, that can then be
passed on to the "real" algorithm.</P>
<H4><A name="u32regex_match"></A>u32regex_match</H4>
<P>For each <A href="regex_match.html">regex_match</A> algorithm defined by
&lt;boost/regex.hpp&gt;, then &lt;boost/regex/icu.hpp&gt; defines an overloaded
algorithm that takes the same arguments, but which is called <EM>u32regex_match</EM>,
and which will accept UTF-8, UTF-16 or UTF-32 encoded data, as well as an
ICU&nbsp;UnicodeString as input.</P>
<P><STRONG>Example: </STRONG>match a password, encoded in a UTF-16 UnicodeString:</P>
<PRE>//
// Find out if *password* meets our password requirements,
// as defined by the regular expression *requirements*.
//
bool is_valid_password(const UnicodeString&amp; password, const UnicodeString&amp; requirements)
{
return boost::u32regex_match(password, boost::make_u32regex(requirements));
}
</PRE>
<P>
<P><STRONG>Example: </STRONG>match a UTF-8 encoded filename:</P>
<PRE>//
// Extract filename part of a path from a UTF-8 encoded std::string and return the result
// as another std::string:
//
std::string get_filename(const std::string&amp; path)
{
boost::u32regex r = boost::make_u32regex("(?:\\A|.*\\\\)([^\\\\]+)");
boost::smatch what;
if(boost::u32regex_match(path, what, r))
{
// extract $1 as a CString:
return what.str(1);
}
else
{
throw std::runtime_error("Invalid pathname");
}
}
</PRE>
<H4><A name="u32regex_search"></A>u32regex_search</H4>
<P>For each <A href="regex_search.html">regex_search</A> algorithm defined by
&lt;boost/regex.hpp&gt;, then &lt;boost/regex/icu.hpp&gt; defines an overloaded
algorithm that takes the same arguments, but which is called <EM>u32regex_search</EM>,
and which will accept UTF-8, UTF-16 or UTF-32 encoded data, as well as an
ICU&nbsp;UnicodeString as input.</P>
<P><STRONG>Example: </STRONG>search for a character sequence in a specific
language block:
</P>
<PRE>UnicodeString extract_greek(const UnicodeString&amp; text)
{
// searches through some UTF-16 encoded text for a block encoded in Greek,
// this expression is imperfect, but the best we can do for now - searching
// for specific scripts is actually pretty hard to do right.
//
// Here we search for a character sequence that begins with a Greek letter,
// and continues with characters that are either not-letters ( [^[:L*:]] )
// or are characters in the Greek character block ( [\\x{370}-\\x{3FF}] ).
//
boost::u32regex r = boost::make_u32regex(L"[\\x{370}-\\x{3FF}](?:[^[:L*:]]|[\\x{370}-\\x{3FF}])*");
boost::u16match what;
if(boost::u32regex_search(text, what, r))
{
// extract $0 as a CString:
return UnicodeString(what[0].first, what.length(0));
}
else
{
throw std::runtime_error("No Greek found!");
}
}</PRE>
<H4><A name="u32regex_replace"></A>u32regex_replace</H4>
<P>For each <A href="regex_replace.html">regex_replace</A> algorithm defined by
&lt;boost/regex.hpp&gt;, then &lt;boost/regex/icu.hpp&gt; defines an overloaded
algorithm that takes the same arguments, but which is called <EM>u32regex_replace</EM>,
and which will accept UTF-8, UTF-16 or UTF-32 encoded data, as well as an
ICU&nbsp;UnicodeString as input.&nbsp; The input sequence and the format string
specifier passed to the algorithm, can be encoded independently (for example
one can be UTF-8, the other in UTF-16), but the result string / output iterator
argument must use the same character encoding as the text being searched.</P>
<P><STRONG>Example: </STRONG>Credit card number reformatting:</P>
<PRE>//
// Take a credit card number as a string of digits,
// and reformat it as a human readable string with "-"
// separating each group of four digit;,
// note that we're mixing a UTF-32 regex, with a UTF-16
// string and a UTF-8 format specifier, and it still all
// just works:
//
const boost::u32regex e = boost::make_u32regex("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z");
const char* human_format = "$1-$2-$3-$4";
UnicodeString human_readable_card_number(const UnicodeString&amp; s)
{
return boost::u32regex_replace(s, e, human_format);
}</PRE>
<P>
<H2><A name="iterators"></A>Iterators</H2>
<H3><A name="u32regex_iterator"></A>u32regex_iterator</H3>
<P>Type u32regex_iterator is in all respects the same as <A href="regex_iterator.html">
regex_iterator</A> except that since the regular expression type is always
u32regex it only takes one template parameter (the iterator type). It also
calls u32regex_search internally, allowing it to interface correctly with
UTF-8, UTF-16, and UTF-32 data:</P>
<PRE>
template &lt;class BidirectionalIterator&gt;
class u32regex_iterator
{
// for members see <A href="regex_iterator.html">regex_iterator</A>
};
typedef u32regex_iterator&lt;const char*&gt; utf8regex_iterator;
typedef u32regex_iterator&lt;const UChar*&gt; utf16regex_iterator;
typedef u32regex_iterator&lt;const UChar32*&gt; utf32regex_iterator;
</PRE>
<P>In order to simplify the construction of a u32regex_iterator from a string,
there are a series of non-member helper functions called
make_u32regex_iterator:</P>
<PRE>
u32regex_iterator&lt;const char*&gt;
make_u32regex_iterator(const char* s,
const u32regex&amp; e,
regex_constants::match_flag_type m = regex_constants::match_default);
u32regex_iterator&lt;const wchar_t*&gt;
make_u32regex_iterator(const wchar_t* s,
const u32regex&amp; e,
regex_constants::match_flag_type m = regex_constants::match_default);
u32regex_iterator&lt;const UChar*&gt;
make_u32regex_iterator(const UChar* s,
const u32regex&amp; e,
regex_constants::match_flag_type m = regex_constants::match_default);
template &lt;class charT, class Traits, class Alloc&gt;
u32regex_iterator&lt;typename std::basic_string&lt;charT, Traits, Alloc&gt;::const_iterator&gt;
make_u32regex_iterator(const std::basic_string&lt;charT, Traits, Alloc&gt;&amp; s,
const u32regex&amp; e,
regex_constants::match_flag_type m = regex_constants::match_default);
u32regex_iterator&lt;const UChar*&gt;
make_u32regex_iterator(const UnicodeString&amp; s,
const u32regex&amp; e,
regex_constants::match_flag_type m = regex_constants::match_default);</PRE>
<P>
<P>Each of these overloads returns an iterator that enumerates all occurrences of
expression <EM>e</EM>, in text <EM>s</EM>, using match_flags <EM>m.</EM></P>
<P><STRONG>Example</STRONG>: search for international currency symbols, along with
their associated numeric value:</P>
<PRE>
void enumerate_currencies(const std::string&amp; text)
{
// enumerate and print all the currency symbols, along
// with any associated numeric values:
const char* re =
"([[:Sc:]][[:Cf:][:Cc:][:Z*:]]*)?"
"([[:Nd:]]+(?:[[:Po:]][[:Nd:]]+)?)?"
"(?(1)"
"|(?(2)"
"[[:Cf:][:Cc:][:Z*:]]*"
")"
"[[:Sc:]]"
")";
boost::u32regex r = boost::make_u32regex(re);
boost::u32regex_iterator&lt;std::string::const_iterator&gt; i(boost::make_u32regex_iterator(text, r)), j;
while(i != j)
{
std::cout &lt;&lt; (*i)[0] &lt;&lt; std::endl;
++i;
}
}</PRE>
<P>
<P>Calling
</P>
<PRE>enumerate_currencies(" $100.23 or <20>198.12 ");</PRE>
<P>Yields the output:</P>
<PRE>$100.23<BR><EFBFBD>198.12</PRE>
<P>Provided of course that the input is encoded as UTF-8.</P>
<H3><A name="u32regex_token_iterator"></A>u32regex_token_iterator</H3>
<P>Type u32regex_token_iterator is in all respects the same as <A href="regex_token_iterator.html">
regex_token_iterator</A> except that since the regular expression type is
always u32regex it only takes one template parameter (the iterator type).&nbsp;
It also calls u32regex_search internally, allowing it to interface correctly
with UTF-8, UTF-16, and UTF-32 data:</P>
<PRE>template &lt;class BidirectionalIterator&gt;
class u32regex_token_iterator
{
// for members see <A href="regex_token_iterator.hmtl">regex_token_iterator</A>
};
typedef u32regex_token_iterator&lt;const char*&gt; utf8regex_token_iterator;
typedef u32regex_token_iterator&lt;const UChar*&gt; utf16regex_token_iterator;
typedef u32regex_token_iterator&lt;const UChar32*&gt; utf32regex_token_iterator;
</PRE>
<P>In order to simplify the construction of a u32regex_token_iterator from a
string, there are a series of non-member helper functions called
make_u32regex_token_iterator:</P>
<PRE>
u32regex_token_iterator&lt;const char*&gt;
make_u32regex_token_iterator(const char* s,
const u32regex&amp; e,
int sub,
regex_constants::match_flag_type m = regex_constants::match_default);
u32regex_token_iterator&lt;const wchar_t*&gt;
make_u32regex_token_iterator(const wchar_t* s,
const u32regex&amp; e,
int sub,
regex_constants::match_flag_type m = regex_constants::match_default);
u32regex_token_iterator&lt;const UChar*&gt;
make_u32regex_token_iterator(const UChar* s,
const u32regex&amp; e,
int sub,
regex_constants::match_flag_type m = regex_constants::match_default);
template &lt;class charT, class Traits, class Alloc&gt;
u32regex_token_iterator&lt;typename std::basic_string&lt;charT, Traits, Alloc&gt;::const_iterator&gt;
make_u32regex_token_iterator(const std::basic_string&lt;charT, Traits, Alloc&gt;&amp; s,
const u32regex&amp; e,
int sub,
regex_constants::match_flag_type m = regex_constants::match_default);
u32regex_token_iterator&lt;const UChar*&gt;
make_u32regex_token_iterator(const UnicodeString&amp; s,
const u32regex&amp; e,
int sub,
regex_constants::match_flag_type m = regex_constants::match_default);</PRE>
<P>
<P>Each of these overloads returns an iterator that enumerates all occurrences of
marked sub-expression <EM>sub</EM> in regular expression&nbsp;<EM>e</EM>, found
in text <EM>s</EM>, using match_flags <EM>m.</EM></P>
<PRE>
template &lt;std::size_t N&gt;
u32regex_token_iterator&lt;const char*&gt;
make_u32regex_token_iterator(const char* p,
const u32regex&amp; e,
const int (&amp;submatch)[N],
regex_constants::match_flag_type m = regex_constants::match_default);
template &lt;std::size_t N&gt;
u32regex_token_iterator&lt;const wchar_t*&gt;
make_u32regex_token_iterator(const wchar_t* p,
const u32regex&amp; e,
const int (&amp;submatch)[N],
regex_constants::match_flag_type m = regex_constants::match_default);
template &lt;std::size_t N&gt;
u32regex_token_iterator&lt;const UChar*&gt;
make_u32regex_token_iterator(const UChar* p,
const u32regex&amp; e,
const int (&amp;submatch)[N],
regex_constants::match_flag_type m = regex_constants::match_default);
template &lt;class charT, class Traits, class Alloc, std::size_t N&gt;
u32regex_token_iterator&lt;typename std::basic_string&lt;charT, Traits, Alloc&gt;::const_iterator&gt;
make_u32regex_token_iterator(const std::basic_string&lt;charT, Traits, Alloc&gt;&amp; p,
const u32regex&amp; e,
const int (&amp;submatch)[N],
regex_constants::match_flag_type m = regex_constants::match_default);
template &lt;std::size_t N&gt;
u32regex_token_iterator&lt;const UChar*&gt;
make_u32regex_token_iterator(const UnicodeString&amp; s,
const u32regex&amp; e,
const int (&amp;submatch)[N],
regex_constants::match_flag_type m = regex_constants::match_default);
</PRE>
<P>Each of these overloads returns an iterator that enumerates one sub-expression
for each&nbsp;<EM>submatch</EM> in regular expression&nbsp;<EM>e</EM>, found in
text <EM>s</EM>, using match_flags <EM>m.</EM></P>
<PRE>
u32regex_token_iterator&lt;const char*&gt;
make_u32regex_token_iterator(const char* p,
const u32regex&amp; e,
const std::vector&lt;int&gt;&amp; submatch,
regex_constants::match_flag_type m = regex_constants::match_default);
u32regex_token_iterator&lt;const wchar_t*&gt;
make_u32regex_token_iterator(const wchar_t* p,
const u32regex&amp; e,
const std::vector&lt;int&gt;&amp; submatch,
regex_constants::match_flag_type m = regex_constants::match_default);
u32regex_token_iterator&lt;const UChar*&gt;
make_u32regex_token_iterator(const UChar* p,
const u32regex&amp; e,
const std::vector&lt;int&gt;&amp; submatch,
regex_constants::match_flag_type m = regex_constants::match_default);
template &lt;class charT, class Traits, class Alloc&gt;
u32regex_token_iterator&lt;typename std::basic_string&lt;charT, Traits, Alloc&gt;::const_iterator&gt;
make_u32regex_token_iterator(const std::basic_string&lt;charT, Traits, Alloc&gt;&amp; p,
const u32regex&amp; e,
const std::vector&lt;int&gt;&amp; submatch,
regex_constants::match_flag_type m = regex_constants::match_default);
u32regex_token_iterator&lt;const UChar*&gt;
make_u32regex_token_iterator(const UnicodeString&amp; s,
const u32regex&amp; e,
const std::vector&lt;int&gt;&amp; submatch,
regex_constants::match_flag_type m = regex_constants::match_default);
</PRE>
<P>Each of these overloads returns an iterator that enumerates one sub-expression
for each&nbsp;<EM>submatch</EM> in regular expression&nbsp;<EM>e</EM>, found in
text <EM>s</EM>, using match_flags <EM>m.</EM></P>
<P><STRONG>Example</STRONG>: search for international currency symbols, along with
their associated numeric value:</P>
<PRE>
void enumerate_currencies2(const std::string&amp; text)
{
// enumerate and print all the currency symbols, along
// with any associated numeric values:
const char* re =
"([[:Sc:]][[:Cf:][:Cc:][:Z*:]]*)?"
"([[:Nd:]]+(?:[[:Po:]][[:Nd:]]+)?)?"
"(?(1)"
"|(?(2)"
"[[:Cf:][:Cc:][:Z*:]]*"
")"
"[[:Sc:]]"
")";
boost::u32regex r = boost::make_u32regex(re);
boost::u32regex_token_iterator&lt;std::string::const_iterator&gt;
i(boost::make_u32regex_token_iterator(text, r, 1)), j;
while(i != j)
{
std::cout &lt;&lt; *i &lt;&lt; std::endl;
++i;
}
}
</PRE>
<P>
<HR>
<p>Revised&nbsp;
<!--webbot bot="Timestamp" S-Type="EDITED" S-Format="%d %B, %Y" startspan -->
05 Jan 2005&nbsp;
<!--webbot bot="Timestamp" endspan i-checksum="39359" --></p>
<p><i><EFBFBD> Copyright John Maddock&nbsp;2005</i></p>
<P><I>Use, modification and distribution are subject to the Boost Software License,
Version 1.0. (See accompanying file <A href="../../../LICENSE_1_0.txt">LICENSE_1_0.txt</A>
or copy at <A href="http://www.boost.org/LICENSE_1_0.txt">http://www.boost.org/LICENSE_1_0.txt</A>)</I></P>
</body>
</html>