Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

huh?

The character pictured above was tweeted a few months ago by Mikko Hypp?nen, a computer security expert known for his work on computer viruses and TED talks on computer security. In respect for SO, I will only post an image of it, but you get the idea. It's obviously not something you'd want spreading around your website and freaking out visitors.

Upon further inspection, the character appears to be a letter of the Thai alphabet combined with over 87 diacritics (is there even a limit?!). This got me thinking about security, localization, and how one might handle this sort of input. My searching lead me to this question on Stack, and in turn a blog post from Michael Kaplan on stripping diacritics. In it, he demonstrates how one can decompose a string into its "base" characters (simplified here for the sake of brevity):

StringBuilder sb = new StringBuilder();
foreach (char c in "fa?ade".Normalize(NormalizationForm.FormD))
{
    if (char.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
        sb.Append(c);
}
Response.Write(sb.ToString()); // facade 

I can see how that this is would be useful in some cases, but in terms of user input, it would be stripping out ALL diacritics. As Kaplan points out, removing the diacritics in some languages can completely change the meaning to the word. This begs the question: How does one permit some diacritics in user input/output, but exclude others extreme cases such as Mikko Hypp?nen's über character?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
254 views
Welcome To Ask or Share your Answers For Others

1 Answer

is there even a limit?!

Not intrinsically in Unicode. There is the concept of a 'Stream-Safe' format in UAX-15 that sets a limit of 30 combiners... Unicode strings in general are not guaranteed to be Stream-Safe, but this could certainly be taken as a sign that Unicode don't intend to standardise new characters that would require a grapheme cluster longer than that.

30 is still an awful lot. The longest known natural-language grapheme cluster is the Tibetan Hak?hmalawaraya? at 1 base plus 8 combiners, so for now it would be reasonable to normalise to NFD and disallow any sequence of more than 8 combiners in a row.

If you only care about common Western European languages you can probably bring that down to 2. So potentially compromise somewhere between those.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...