The exact nature of wide characters is (purposefully) left implementation defined.
When they first invented the concept of wchar_t
, ISO 10646 and Unicode were still competing with each other (whereas they now, mostly cooperate). Rather than try to decree that an international character would be one or the other (or possibly something else entirely) they simply provided a type (and some functions) that the implementation could define to support international character sets as they chose.
Different implementations have exercised that potential for variation. For example, if you use Microsoft's compiler on Windows, wchar_t
will be a 16-bit type holding UTF-16 Unicode (originally it held UCS-2 Unicode, but that's now officially obsolete).
On Linux, wchar_t
will more often be a 32-bit type, holding UCS-4/UTF-32 encoded Unicode. Ports of gcc to at least some other operating systems do the same, though I've never tried to confirm that it's always the case.
There is, however, no guarantee of that. At least in theory an implementation on Linux could use 16 bits, or one on Windows could use 32 bits, or either one could decide to use 64 bits (though I'd be a little surprised to see that in reality).
In any case, the general idea of how things are intended to work, is that a single wchar_t
is sufficient to represent a code point. For I/O, the data is intended to be converted from the external representation (whatever it is) into wchar_t
s, which (is supposed to) make them relatively easy to manipulate. Then during output, they again get transformed into the encoding of your choice (which may be entirely different from the encoding you read).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…