php - Remove all except the chinese characters with regex?

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

php - Remove all except the chinese characters with regex?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I have a string that is a sentence, written in chinese.

This contains chinese characters, and other filler things, like spaces, comma, exclamation marks and etc., all encoded in UTF8.

Using regex with a latin1 string, I could use preg_replace and [a-zA-Z] to clean it and remove the filler.

How can I keep only the chinese "alphabet" characters in the chinese string while removing all the filler items?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

400 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:23:36+0000

According to this document, here are the unicode ranges of chinese characters:

Table 12-2. Blocks Containing Han Ideographs

Block                                Range         Comment
CJK Unified Ideographs               4E00–9FFF     Common
CJK Unified Ideographs Extension A   3400–4DBF     Rare
CJK Unified Ideographs Extension B   20000–2A6DF   Rare, historic
CJK Unified Ideographs Extension C   2A700–2B73F   Rare, historic
CJK Unified Ideographs Extension D   2B740–2B81F   Uncommon, some in current use
CJK Compatibility Ideographs         F900–FAFF     Duplicates, unifiable variants, corporate
characters
CJK Compatibility Ideographs Supplement 2F800–2FA1F Unifiable variants

You could use it like this:

preg_replace('/[^u4E00-u9FFF]+/u', '', $string);

or

preg_replace('/P{Han}+/u', '', $string);

where P is the negation of p

see here for all the unicode scripts

Categories

php - Remove all except the chinese characters with regex?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags