2007-06-08 09:23:23 +00:00
< html >
< head >
2020-10-12 18:22:57 +01:00
< meta http-equiv = "Content-Type" content = "text/html; charset=UTF-8" >
2007-12-14 10:11:21 +00:00
< title > Unicode and Boost.Regex< / title >
2010-07-08 22:49:58 +00:00
< link rel = "stylesheet" href = "../../../../../doc/src/boostbook.css" type = "text/css" >
2019-10-26 10:51:25 +01:00
< meta name = "generator" content = "DocBook XSL Stylesheets V1.79.1" >
2022-03-08 11:26:11 +00:00
< link rel = "home" href = "../index.html" title = "Boost.Regex 7.0.1" >
< link rel = "up" href = "../index.html" title = "Boost.Regex 7.0.1" >
2017-08-01 18:01:46 +01:00
< link rel = "prev" href = "intro.html" title = "Introduction and Overview" >
2007-12-14 10:11:21 +00:00
< link rel = "next" href = "captures.html" title = "Understanding Marked Sub-Expressions and Captures" >
2024-03-23 19:03:45 +00:00
< meta name = "viewport" content = "width=device-width, initial-scale=1" >
2007-06-08 09:23:23 +00:00
< / head >
< body bgcolor = "white" text = "black" link = "#0000FF" vlink = "#840084" alink = "#0000FF" >
2007-08-13 17:54:01 +00:00
< table cellpadding = "2" width = "100%" > < tr >
2007-06-08 09:23:23 +00:00
< td valign = "top" > < img alt = "Boost C++ Libraries" width = "277" height = "86" src = "../../../../../boost.png" > < / td >
2008-04-11 08:53:54 +00:00
< td align = "center" > < a href = "../../../../../index.html" > Home< / a > < / td >
2007-06-08 09:23:23 +00:00
< td align = "center" > < a href = "../../../../../libs/libraries.htm" > Libraries< / a > < / td >
2008-07-25 09:28:01 +00:00
< td align = "center" > < a href = "http://www.boost.org/users/people.html" > People< / a > < / td >
< td align = "center" > < a href = "http://www.boost.org/users/faq.html" > FAQ< / a > < / td >
2007-06-08 09:23:23 +00:00
< td align = "center" > < a href = "../../../../../more/index.htm" > More< / a > < / td >
2007-08-13 17:54:01 +00:00
< / tr > < / table >
2007-06-08 09:23:23 +00:00
< hr >
< div class = "spirit-nav" >
2017-08-01 18:01:46 +01:00
< a accesskey = "p" href = "intro.html" > < img src = "../../../../../doc/src/images/prev.png" alt = "Prev" > < / a > < a accesskey = "u" href = "../index.html" > < img src = "../../../../../doc/src/images/up.png" alt = "Up" > < / a > < a accesskey = "h" href = "../index.html" > < img src = "../../../../../doc/src/images/home.png" alt = "Home" > < / a > < a accesskey = "n" href = "captures.html" > < img src = "../../../../../doc/src/images/next.png" alt = "Next" > < / a >
2007-06-08 09:23:23 +00:00
< / div >
2013-12-14 17:42:13 +00:00
< div class = "section" >
2007-06-08 09:23:23 +00:00
< div class = "titlepage" > < div > < div > < h2 class = "title" style = "clear: both" >
2011-01-01 12:27:00 +00:00
< a name = "boost_regex.unicode" > < / a > < a class = "link" href = "unicode.html" title = "Unicode and Boost.Regex" > Unicode and Boost.Regex< / a >
2007-12-14 10:11:21 +00:00
< / h2 > < / div > < / div > < / div >
2007-06-08 09:23:23 +00:00
< p >
There are two ways to use Boost.Regex with Unicode strings:
< / p >
2011-12-24 17:51:57 +00:00
< h5 >
< a name = "boost_regex.unicode.h0" > < / a >
2012-11-29 10:28:07 +00:00
< span class = "phrase" > < a name = "boost_regex.unicode.rely_on_wchar_t" > < / a > < / span > < a class = "link" href = "unicode.html#boost_regex.unicode.rely_on_wchar_t" > Rely
2011-12-24 17:51:57 +00:00
on wchar_t< / a >
2007-12-14 10:11:21 +00:00
< / h5 >
2007-06-08 09:23:23 +00:00
< p >
2007-12-14 10:11:21 +00:00
If your platform's < code class = "computeroutput" > < span class = "keyword" > wchar_t< / span > < / code > type
2007-06-08 09:23:23 +00:00
can hold Unicode strings, and your platform's C/C++ runtime correctly handles
2007-12-14 10:11:21 +00:00
wide character constants (when passed to < code class = "computeroutput" > < span class = "identifier" > std< / span > < span class = "special" > ::< / span > < span class = "identifier" > iswspace< / span > < / code >
< code class = "computeroutput" > < span class = "identifier" > std< / span > < span class = "special" > ::< / span > < span class = "identifier" > iswlower< / span > < / code > etc), then you can use < code class = "computeroutput" > < span class = "identifier" > boost< / span > < span class = "special" > ::< / span > < span class = "identifier" > wregex< / span > < / code >
2007-06-08 09:23:23 +00:00
to process Unicode. However, there are several disadvantages to this approach:
< / p >
2012-11-29 10:28:07 +00:00
< div class = "itemizedlist" > < ul class = "itemizedlist" style = "list-style-type: disc; " >
2011-07-21 10:01:09 +00:00
< li class = "listitem" >
2010-07-08 22:49:58 +00:00
It's not portable: there's no guarantee on the width of < code class = "computeroutput" > < span class = "keyword" > wchar_t< / span > < / code > ,
or even whether the runtime treats wide characters as Unicode at all, most
Windows compilers do so, but many Unix systems do not.
< / li >
2011-07-21 10:01:09 +00:00
< li class = "listitem" >
2010-07-08 22:49:58 +00:00
There's no support for Unicode-specific character classes: < code class = "computeroutput" > < span class = "special" > [[:< / span > < span class = "identifier" > Nd< / span > < span class = "special" > :]]< / span > < / code > , < code class = "computeroutput" > < span class = "special" > [[:< / span > < span class = "identifier" > Po< / span > < span class = "special" > :]]< / span > < / code >
etc.
< / li >
2011-07-21 10:01:09 +00:00
< li class = "listitem" >
2010-07-08 22:49:58 +00:00
You can only search strings that are encoded as sequences of wide characters,
it is not possible to search UTF-8, or even UTF-16 on many platforms.
< / li >
2007-06-08 09:23:23 +00:00
< / ul > < / div >
2011-12-24 17:51:57 +00:00
< h5 >
< a name = "boost_regex.unicode.h1" > < / a >
2015-10-15 13:27:45 +01:00
< span class = "phrase" > < a name = "boost_regex.unicode.use_a_unicode_aware_regular_expr" > < / a > < / span > < a class = "link" href = "unicode.html#boost_regex.unicode.use_a_unicode_aware_regular_expr" > Use
2007-06-08 09:23:23 +00:00
a Unicode Aware Regular Expression Type.< / a >
2007-12-14 10:11:21 +00:00
< / h5 >
2007-06-08 09:23:23 +00:00
< p >
If you have the < a href = "http://www.ibm.com/software/globalization/icu/" target = "_top" > ICU
2021-10-10 16:41:19 +01:00
library< / a > , then Boost.Regex provides a distinct regular expression type
(boost::u32regex), that supports both Unicode specific character properties,
and the searching of text that is encoded in either UTF-8, UTF-16, or UTF-32.
See: < a class = "link" href = "ref/non_std_strings/icu.html" title = "Working With Unicode and ICU String Types" > ICU string class support< / a > .
2007-06-08 09:23:23 +00:00
< / p >
< / div >
2024-03-23 19:03:45 +00:00
< div class = "copyright-footer" > Copyright © 1998-2013 John Maddock< p >
2007-11-07 03:23:31 +00:00
Distributed under the Boost Software License, Version 1.0. (See accompanying
file LICENSE_1_0.txt or copy at < a href = "http://www.boost.org/LICENSE_1_0.txt" target = "_top" > http://www.boost.org/LICENSE_1_0.txt< / a > )
2007-12-14 10:11:21 +00:00
< / p >
2024-03-23 19:03:45 +00:00
< / div >
2007-06-08 09:23:23 +00:00
< hr >
< div class = "spirit-nav" >
2017-08-01 18:01:46 +01:00
< a accesskey = "p" href = "intro.html" > < img src = "../../../../../doc/src/images/prev.png" alt = "Prev" > < / a > < a accesskey = "u" href = "../index.html" > < img src = "../../../../../doc/src/images/up.png" alt = "Up" > < / a > < a accesskey = "h" href = "../index.html" > < img src = "../../../../../doc/src/images/home.png" alt = "Home" > < / a > < a accesskey = "n" href = "captures.html" > < img src = "../../../../../doc/src/images/next.png" alt = "Next" > < / a >
2007-06-08 09:23:23 +00:00
< / div >
< / body >
< / html >