Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have a string that is a sentence, written in chinese.

This contains chinese characters, and other filler things, like spaces, comma, exclamation marks and etc., all encoded in UTF8.

Using regex with a latin1 string, I could use preg_replace and [a-zA-Z] to clean it and remove the filler.

How can I keep only the chinese "alphabet" characters in the chinese string while removing all the filler items?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
400 views
Welcome To Ask or Share your Answers For Others

1 Answer

According to this document, here are the unicode ranges of chinese characters:

Table 12-2. Blocks Containing Han Ideographs

Block                                Range         Comment
CJK Unified Ideographs               4E00–9FFF     Common
CJK Unified Ideographs Extension A   3400–4DBF     Rare
CJK Unified Ideographs Extension B   20000–2A6DF   Rare, historic
CJK Unified Ideographs Extension C   2A700–2B73F   Rare, historic
CJK Unified Ideographs Extension D   2B740–2B81F   Uncommon, some in current use
CJK Compatibility Ideographs         F900–FAFF     Duplicates, unifiable variants, corporate
characters
CJK Compatibility Ideographs Supplement 2F800–2FA1F Unifiable variants

You could use it like this:

preg_replace('/[^u4E00-u9FFF]+/u', '', $string);

or

preg_replace('/P{Han}+/u', '', $string);

where P is the negation of p

see here for all the unicode scripts


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...