Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I am not able to understand the differences between std::string and std::wstring . (我无法理解std::stringstd::wstring之间的区别。) I know wstring supports wide characters such as Unicode characters. (我知道wstring支持宽字符,例如Unicode字符。) I have got the following questions: (我有以下问题:)

  1. When should I use std::wstring over std::string ? (什么时候应该在std::string使用std::wstring ?)
  2. Can std::string hold the entire ASCII character set, including the special characters? (std::string容纳整个ASCII字符集,包括特殊字符吗?)
  3. Is std::wstring supported by all popular C++ compilers? (所有流行的C ++编译器都支持std::wstring吗?)
  4. What is exactly a " wide character "? (什么是“ 宽字符 ”?)
  ask by translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
180 views
Welcome To Ask or Share your Answers For Others

1 Answer

string ? (string ?) wstring ? (wstring ?)

std::string is a basic_string templated on a char , and std::wstring on a wchar_t . (std::string是在char上模板化的basic_string ,而std::wstringwchar_t上模板化。)

char vs. wchar_t (char vs. wchar_t)

char is supposed to hold a character, usually an 8-bit character. (char应该包含一个字符,通常是8位字符。)
wchar_t is supposed to hold a wide character, and then, things get tricky: (wchar_t应该具有宽字符,然后,事情变得棘手:)
On Linux, a wchar_t is 4 bytes, while on Windows, it's 2 bytes. (在Linux上, wchar_t是4个字节,而在Windows上,它是2个字节。)

What about Unicode , then? (那么Unicode呢?)

The problem is that neither char nor wchar_t is directly tied to unicode. (问题是charwchar_t都没有直接绑定到unicode。)

On Linux? (在Linux上?)

Let's take a Linux OS: My Ubuntu system is already unicode aware. (让我们以Linux操作系统为例:我的Ubuntu系统已经支持Unicode。) When I work with a char string, it is natively encoded in UTF-8 (ie Unicode string of chars). (当我使用char字符串时,它以UTF-8 (即char的Unicode字符串)本地编码。) The following code: (如下代码:)

#include <cstring>
#include <iostream>

int main(int argc, char* argv[])
{
   const char text[] = "olé" ;


   std::cout << "sizeof(char)    : " << sizeof(char) << std::endl ;
   std::cout << "text            : " << text << std::endl ;
   std::cout << "sizeof(text)    : " << sizeof(text) << std::endl ;
   std::cout << "strlen(text)    : " << strlen(text) << std::endl ;

   std::cout << "text(ordinals)  :" ;

   for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
   {
      std::cout << " " << static_cast<unsigned int>(
                              static_cast<unsigned char>(text[i])
                          );
   }

   std::cout << std::endl << std::endl ;

   // - - - 

   const wchar_t wtext[] = L"olé" ;

   std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl ;
   //std::cout << "wtext           : " << wtext << std::endl ; <- error
   std::cout << "wtext           : UNABLE TO CONVERT NATIVELY." << std::endl ;
   std::wcout << L"wtext           : " << wtext << std::endl;

   std::cout << "sizeof(wtext)   : " << sizeof(wtext) << std::endl ;
   std::cout << "wcslen(wtext)   : " << wcslen(wtext) << std::endl ;

   std::cout << "wtext(ordinals) :" ;

   for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
   {
      std::cout << " " << static_cast<unsigned int>(
                              static_cast<unsigned short>(wtext[i])
                              );
   }

   std::cout << std::endl << std::endl ;

   return 0;
}

outputs the following text: (输出以下文本:)

sizeof(char)    : 1
text            : olé
sizeof(text)    : 5
strlen(text)    : 4
text(ordinals)  : 111 108 195 169

sizeof(wchar_t) : 4
wtext           : UNABLE TO CONVERT NATIVELY.
wtext           : ol?
sizeof(wtext)   : 16
wcslen(wtext)   : 3
wtext(ordinals) : 111 108 233

You'll see the "olé" text in char is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (您会看到char的“olé”文本实际上是由四个字符构成的:110、108、195和169(不计算结尾的零)。) (I'll let you study the wchar_t code as an exercise) ((我将让您学习wchar_t代码作为练习))

So, when working with a char on Linux, you should usually end up using Unicode without even knowing it. (因此,在Linux上使用char时,通常通常甚至在不知道的情况下最终使用Unicode。) And as std::string works with char , so std::string is already unicode-ready. (并且std::stringchar ,因此std::string已经可以使用Unicode了。)

Note that std::string , like the C string API, will consider the "olé" string to have 4 characters, not three. (请注意,与C字符串API一样, std::string将认为“olé”字符串具有4个字符,而不是3个字符。) So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8. (因此,在截断/播放unicode字符时,请务必谨慎,因为UTF-8中禁止使用某些字符组合。)

On Windows? (在Windows上?)

On Windows, this is a bit different. (在Windows上,这有点不同。) Win32 had to support a lot of application working with char and on different charsets / codepages produced in all the world, before the advent of Unicode. (在Unicode出现之前,Win32必须支持许多与char一起使用的应用程序,并支持世界各地生产的不同字符集 / 代码页 。)

So their solution was an interesting one: If an application works with char , then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine. (因此,他们的解决方案是一个有趣的解决方案:如果应用程序使用char ,则使用计算机上的本地charset / codepage将char字符串编码/打印/显示在GUI标签上。) For example, "olé" would be "olé" in a French-localized Windows, but would be something different on an cyrillic-localized Windows ("olй" if you use Windows-1251 ). (例如,在法语本地化的Windows中,“olé”将是“olé”,但是在西里尔语本地化的Windows中,“olé”将有所不同(如果使用Windows-1251,则为“olй”)。) Thus, "historical apps" will usually still work the same old way. (因此,“历史应用程序”通常仍将以相同的旧方式工作。)

For Unicode based applications, Windows uses wchar_t , which is 2-bytes wide, and is encoded in UTF-16 , which is Unicode encoded on 2-bytes characters (or at the very least, the mostly compatible UCS-2, which is almost the same thing IIRC). (对于基于Unicode的应用程序,Windows使用wchar_t ,它是2字节宽,并以UTF-16编码, UTF-16是2字节字符的Unicode编码(或者至少是最兼容的UCS-2,这几乎是IIRC一样)。)

Applications using char are said "multibyte" (because each glyph is composed of one or more char s), while applications using wchar_t are said "widechar" (because each glyph is composed of one or two wchar_t . See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info. (使用char应用程序称为“多字节”(因为每个字形由一个或多个char组成),而使用wchar_t应用程序称为“ widechar”(因为每个字形由一个或两个wchar_t 。请参见MultiByteToWideCharWideCharToMultiByte Win32转换API有关更多信息。)

Thus, if you work on Windows, you badly want to use wchar_t (unless you use a framework hiding that, like GTK+ or QT ...). (因此,如果您在Windows上工作,则非常想使用wchar_t (除非您使用隐藏该框架的框架,例如GTK +QT ...)。) The fact is that behind the scenes, Windows works with wchar_t strings, so even historical applications will have their char strings converted in wchar_t when using API like SetWindowText() (low level API function to set the label on a Win32 GUI). (事实是,在幕后,Windows使用了wchar_t字符串,因此,即使历史应用程序在使用SetWindowText()类的API SetWindowText()在Win32 GUI上设置标签的低级API函数SetWindowText()时,也将在wchar_t转换其char字符串。)

Memory issues? (内存问题?)

UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less). (UTF-32是每个字符4个字节,因此,只要UTF-8文本和UTF-16文本将始终比UTF-32文本使用更少或相同的内存量(通常更少),就没有太多要添加的内容了。 )。)

If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one. (如果存在内存问题,那么您应该比大多数西方语言都知道,UTF-8文本将比相同的UTF-16使用更少的内存。)

Still, for o


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...