Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Given a string in form of a pointer to a array of bytes (chars), how can I detect the encoding of the string in C/C++ (I used visual studio 2008)?? I did a search but most of samples are done in C#.

Thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
568 views
Welcome To Ask or Share your Answers For Others

1 Answer

Assuming you know the length of the input array, you can make the following guesses:

  1. First, check to see if the first few bytes match any well know byte order marks (BOM) for Unicode. If they do, you're done!
  2. Next, search for '' before the last byte. If you find one, you might be dealing with UTF-16 or UTF-32. If you find multiple consecutive ''s, it's probably UTF-32.
  3. If any character is from 0x80 to 0xff, it's certainly not ASCII or UTF-7. If you are restricting your input to some variant of Unicode, you can assume it's UTF-8. Otherwise, you have to do some guessing to determine which multi-byte character set it is. That will not be fun.
  4. At this point it is either: ASCII, UTF-7, Base64, or ranges of UTF-16 or UTF-32 that just happen to not use the top bit and do not have any null characters.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...