algorithm - C: Most efficient way to determine how many bytes will be needed for a UTF-16 string from a UTF-8 string

Question

Welcome To Ask or Share your Answers For Others

algorithm - C: Most efficient way to determine how many bytes will be needed for a UTF-16 string from a UTF-8 string

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I've seen some very clever code out there for converting between Unicode codepoints and UTF-8 so I was wondering if anybody has (or would enjoy devising) this.

Given a UTF-8 string, how many bytes are needed for the UTF-16 encoding of the same string.
Assume the UTF-8 string has already been validated. It has no BOM, no overlong sequences, no invalid sequences, is null-terminated. It is not CESU-8.
Full UTF-16 with surrogates must be supported.

Specifically I wonder if there are shortcuts to knowing when a surrogate pair will be needed without fully converting the UTF-8 sequence into a codepoint.

The best UTF-8 to codepoint code I've seen uses vectorizing techniques so I wonder if that's also possible here.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

938 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:30:47+0000

Efficiency is always a speed vs size tradeoff. If speed is favored over size then the most efficient way is just to guess based on the length of the source string.

There are 4 cases that need to be considered, simply take the worst case as the final buffer size:

U+0000-U+007F - will encode to 1byte in utf8, and 2bytes per character in utf16. (1:2 = x2)
U+0080-U+07FF - encoded to 2byte utf8 sequences, or 2byte per character utf16 characters. (2:2 = x1)
U+0800-U+FFFF - are stored as 3byte utf8 sequences, but still fit in single utf16 characters. (3:2 = x.67)
U+10000-U+10FFFF - are stored as 4byte utf8 sequences, or surrogate pairs in utf16. (4:4 = x1)

The worse case expansion factor is when translating U+0000-U+007f from utf8 to utf16: the buffer, bytewise, merely has to be twice as large as the source string. Every other unicode codepoint results in an equal size, or smaller bytewise allocation when encoded as utf16 as utf8.

Categories

algorithm - C: Most efficient way to determine how many bytes will be needed for a UTF-16 string from a UTF-8 string

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags