string
? (string
?) wstring
? (wstring
?)
std::string
is a basic_string
templated on a char
, and std::wstring
on a wchar_t
. (std::string
是在char
上模板化的basic_string
,而std::wstring
在wchar_t
上模板化。)
char
vs. wchar_t
(char
vs. wchar_t
)
char
is supposed to hold a character, usually an 8-bit character. (char
应该包含一个字符,通常是8位字符。)
wchar_t
is supposed to hold a wide character, and then, things get tricky: (wchar_t
应该具有宽字符,然后,事情变得棘手:)
On Linux, a wchar_t
is 4 bytes, while on Windows, it's 2 bytes. (在Linux上, wchar_t
是4个字节,而在Windows上,它是2个字节。)
The problem is that neither char
nor wchar_t
is directly tied to unicode. (问题是char
和wchar_t
都没有直接绑定到unicode。)
On Linux? (在Linux上?)
Let's take a Linux OS: My Ubuntu system is already unicode aware. (让我们以Linux操作系统为例:我的Ubuntu系统已经支持Unicode。) When I work with a char string, it is natively encoded in UTF-8 (ie Unicode string of chars). (当我使用char字符串时,它以UTF-8 (即char的Unicode字符串)本地编码。) The following code: (如下代码:)
#include <cstring>
#include <iostream>
int main(int argc, char* argv[])
{
const char text[] = "olé" ;
std::cout << "sizeof(char) : " << sizeof(char) << std::endl ;
std::cout << "text : " << text << std::endl ;
std::cout << "sizeof(text) : " << sizeof(text) << std::endl ;
std::cout << "strlen(text) : " << strlen(text) << std::endl ;
std::cout << "text(ordinals) :" ;
for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
{
std::cout << " " << static_cast<unsigned int>(
static_cast<unsigned char>(text[i])
);
}
std::cout << std::endl << std::endl ;
// - - -
const wchar_t wtext[] = L"olé" ;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl ;
//std::cout << "wtext : " << wtext << std::endl ; <- error
std::cout << "wtext : UNABLE TO CONVERT NATIVELY." << std::endl ;
std::wcout << L"wtext : " << wtext << std::endl;
std::cout << "sizeof(wtext) : " << sizeof(wtext) << std::endl ;
std::cout << "wcslen(wtext) : " << wcslen(wtext) << std::endl ;
std::cout << "wtext(ordinals) :" ;
for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
{
std::cout << " " << static_cast<unsigned int>(
static_cast<unsigned short>(wtext[i])
);
}
std::cout << std::endl << std::endl ;
return 0;
}
outputs the following text: (输出以下文本:)
sizeof(char) : 1
text : olé
sizeof(text) : 5
strlen(text) : 4
text(ordinals) : 111 108 195 169
sizeof(wchar_t) : 4
wtext : UNABLE TO CONVERT NATIVELY.
wtext : ol?
sizeof(wtext) : 16
wcslen(wtext) : 3
wtext(ordinals) : 111 108 233
You'll see the "olé" text in char
is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (您会看到char
的“olé”文本实际上是由四个字符构成的:110、108、195和169(不计算结尾的零)。) (I'll let you study the wchar_t
code as an exercise) ((我将让您学习wchar_t
代码作为练习))
So, when working with a char
on Linux, you should usually end up using Unicode without even knowing it. (因此,在Linux上使用char
时,通常通常甚至在不知道的情况下最终使用Unicode。) And as std::string
works with char
, so std::string
is already unicode-ready. (并且std::string
与char
,因此std::string
已经可以使用Unicode了。)
Note that std::string
, like the C string API, will consider the "olé" string to have 4 characters, not three. (请注意,与C字符串API一样, std::string
将认为“olé”字符串具有4个字符,而不是3个字符。) So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8. (因此,在截断/播放unicode字符时,请务必谨慎,因为UTF-8中禁止使用某些字符组合。)
On Windows? (在Windows上?)
On Windows, this is a bit different. (在Windows上,这有点不同。) Win32 had to support a lot of application working with char
and on different charsets / codepages produced in all the world, before the advent of Unicode. (在Unicode出现之前,Win32必须支持许多与char
一起使用的应用程序,并支持世界各地生产的不同字符集 / 代码页 。)
So their solution was an interesting one: If an application works with char
, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine. (因此,他们的解决方案是一个有趣的解决方案:如果应用程序使用char
,则使用计算机上的本地charset / codepage将char字符串编码/打印/显示在GUI标签上。) For example, "olé" would be "olé" in a French-localized Windows, but would be something different on an cyrillic-localized Windows ("olй" if you use Windows-1251 ). (例如,在法语本地化的Windows中,“olé”将是“olé”,但是在西里尔语本地化的Windows中,“olé”将有所不同(如果使用Windows-1251,则为“olй”)。) Thus, "historical apps" will usually still work the same old way. (因此,“历史应用程序”通常仍将以相同的旧方式工作。)
For Unicode based applications, Windows uses wchar_t
, which is 2-bytes wide, and is encoded in UTF-16 , which is Unicode encoded on 2-bytes characters (or at the very least, the mostly compatible UCS-2, which is almost the same thing IIRC). (对于基于Unicode的应用程序,Windows使用wchar_t
,它是2字节宽,并以UTF-16编码, UTF-16是2字节字符的Unicode编码(或者至少是最兼容的UCS-2,这几乎是IIRC一样)。)
Applications using char
are said "multibyte" (because each glyph is composed of one or more char
s), while applications using wchar_t
are said "widechar" (because each glyph is composed of one or two wchar_t
. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info. (使用char
应用程序称为“多字节”(因为每个字形由一个或多个char
组成),而使用wchar_t
应用程序称为“ widechar”(因为每个字形由一个或两个wchar_t
。请参见MultiByteToWideChar和WideCharToMultiByte Win32转换API有关更多信息。)
Thus, if you work on Windows, you badly want to use wchar_t
(unless you use a framework hiding that, like GTK+ or QT ...). (因此,如果您在Windows上工作,则非常想使用wchar_t
(除非您使用隐藏该框架的框架,例如GTK +或QT ...)。) The fact is that behind the scenes, Windows works with wchar_t
strings, so even historical applications will have their char
strings converted in wchar_t
when using API like SetWindowText()
(low level API function to set the label on a Win32 GUI). (事实是,在幕后,Windows使用了wchar_t
字符串,因此,即使历史应用程序在使用SetWindowText()
类的API SetWindowText()
在Win32 GUI上设置标签的低级API函数SetWindowText()
时,也将在wchar_t
转换其char
字符串。)
Memory issues? (内存问题?)
UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less). (UTF-32是每个字符4个字节,因此,只要UTF-8文本和UTF-16文本将始终比UTF-32文本使用更少或相同的内存量(通常更少),就没有太多要添加的内容了。 )。)
If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one. (如果存在内存问题,那么您应该比大多数西方语言都知道,UTF-8文本将比相同的UTF-16使用更少的内存。)
Still, for o