Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Now Windows 10 Notepad does not require unicode files to have the BOM header and it does not encode the header by default. This does break the existing code that checks the header to determine Unicode in files. How can I now tell in C++ if a file is in unicode? Source: https://www.bleepingcomputer.com/news/microsoft/windows-10-notepad-is-getting-better-utf-8-encoding-support/

The code we have to determine Unicode:

int IsUnicode(const BYTE p2bytes[3])
{
        if( p2bytes[0]==0xEF && p2bytes[1]==0xBB p2bytes[2]==0xBF) 
            return 1; // UTF-8
        if( p2bytes[0]==0xFE && p2bytes[1]==0xFF)
            return 2;  // UTF-16 (BE)
        if( p2bytes[0]==0xFF && p2bytes[1]==0xFE) 
            return 3; // UTF-16 (LE)
            
        return 0;
}

If it's so much pain, why isn't there a typical function to determine the encoding?

question from:https://stackoverflow.com/questions/65933277/detecting-unicode-in-files-in-windows-10

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
139 views
Welcome To Ask or Share your Answers For Others

1 Answer

You should use the W3C method, which it is something like:

  • if you know the encoding, use that

  • if there is a BOM, use it to determine the encoding

  • decode as UTF-8. UTF-8 has strict byte sequence rules (which it is the purpose of UTF-8: being able to find the first byte of a character). So if the file it is not UTF-8, very probably it will fail the decoding: on ANSI (cp-1252) it is not frequent to have accented letters followed by a symbols, and not at all probable that every time you have such sequence. Latin-1: you may get control characters (instead of symbols), but it is also very seldom to have control characters C1 only after accented letters, and always C1 after accented characters.

  • if decoding fails (maybe you can just test first 4096 bytes, or 10 bytes above 127), use the standard 8-bit encoding of the OS (probably cp-1252 on windows).

This method should work very well. It is biased on UTF-8, but the world went to such directions long ago. Determining which codepage is much more difficult.

You may add a step before the last step. If there are various 00 bytes, you may be in a UTF-16 or UTF-32 form. Unicode requires that you know which form (e.g. from side channel), else the files should have a BOM. But you can guess the form (UTF-16LE, UTF-16BE, UTF-32LE, UTF32-BE) according the position of 00 in the file (new lines, and some ASCII characters are considered common scripts, so they are used in many scripts, so you should have many 00).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...