Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

A problem with various character encodings is that the containing file is not always clearly marked. There are inconsistent conventions for marking some using "byte-order-markers" or BOMs. But in essence you have to be told what the file encoding is, to read it accurately.

We build programming tools that read source files, and this gives us grief. We have means to specify defaults, and sniff for BOMs, etc. And we do pretty well with conventions and defaults. But a place we (and I assume everybody else) gets hung up on are UTF-8 files that are not BOM-marked.

Recent MS IDEs (e.g., VS Studio 2010) will apparently "sniff" a file to determine if it is UTF-8 encoded without a BOM. (Being in the tools business, we'd like to be compatible with MS because of their market share, even if it means having to go over the "stupid" cliff with them.) I'm specifically interested in what they use as a heuristic (although discussions of heuristics is fine)? How can it be "right"? (Consider an ISO8859-x encoded string interpreted this way).

EDIT: This paper on detecting character encodings/sets is pretty interesting: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

EDIT December 2012: We ended scanning the entire file to see if it contained any violations of UTF-8 sequences... and if it does not, we call it UTF-8. The bad part of this solution is you have to process the characters twice if it is UTF-8. (If it isn't UTF-8, this test is likely to determine that fairly quickly, unless the file happens to all 7 bit ASCII at which point reading like UTF-8 won't hurt).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
799 views
Welcome To Ask or Share your Answers For Others

1 Answer

If encoding is UTF-8, the first character you see over 0x7F must be the start of a UTF-8 sequence. So test it for that. Here is the code we use for that:

unc ::IsUTF8(unc *cpt)
{
    if (!cpt)
        return 0;

    if ((*cpt & 0xF8) == 0xF0) { // start of 4-byte sequence
        if (((*(cpt + 1) & 0xC0) == 0x80)
         && ((*(cpt + 2) & 0xC0) == 0x80)
         && ((*(cpt + 3) & 0xC0) == 0x80))
            return 4;
    }
    else if ((*cpt & 0xF0) == 0xE0) { // start of 3-byte sequence
        if (((*(cpt + 1) & 0xC0) == 0x80)
         && ((*(cpt + 2) & 0xC0) == 0x80))
            return 3;
    }
    else if ((*cpt & 0xE0) == 0xC0) { // start of 2-byte sequence
        if ((*(cpt + 1) & 0xC0) == 0x80)
            return 2;
    }
    return 0;
}

If you get a return of 0, it is not valid UTF-8. Else skip the number of chars returned and continue checking the next one over 0x7F.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...