I've seen some very clever code out there for converting between Unicode codepoints and UTF-8 so I was wondering if anybody has (or would enjoy devising) this.
- Given a UTF-8 string, how many bytes are needed for the UTF-16 encoding of the same string.
- Assume the UTF-8 string has already been validated. It has no BOM, no overlong sequences, no invalid sequences, is null-terminated. It is not CESU-8.
- Full UTF-16 with surrogates must be supported.
Specifically I wonder if there are shortcuts to knowing when a surrogate pair will be needed without fully converting the UTF-8 sequence into a codepoint.
The best UTF-8 to codepoint code I've seen uses vectorizing techniques so I wonder if that's also possible here.
See Question&Answers more detail:os