Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

In PHP, we can use mb_check_encoding() to determine if a string is valid UTF-8. But that's not a portable solution as it requires the mbstring extension to be compiled in and enabled. Additionally, it won't tell us which character is invalid.

Is there a regular expression (or another other 100% portable method) that can match invalid UTF-8 bytes in a given string?

That way, those bytes can be replaced if needed (keeping the binary information, such as when building a test output XML file that includes binary data). So converting the characters to UTF-8 would lose information. So, we may want to convert:

"foo" . chr(128) . chr(255)

Into

"foo<128><255>"

So just "detecting" that the string is not good enough, we'd need to be able to detect which characters are invalid.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
322 views
Welcome To Ask or Share your Answers For Others

1 Answer

You can use this PCRE regular expression to check for a valid UTF-8 in a string. If the regex matches, the string contains invalid byte sequences. It's 100% portable because it doesn't rely on PCRE_UTF8 to be compiled in.

$regex = '/(
    [xC0-xC1] # Invalid UTF-8 Bytes
    | [xF5-xFF] # Invalid UTF-8 Bytes
    | xE0[x80-x9F] # Overlong encoding of prior code point
    | xF0[x80-x8F] # Overlong encoding of prior code point
    | [xC2-xDF](?![x80-xBF]) # Invalid UTF-8 Sequence Start
    | [xE0-xEF](?![x80-xBF]{2}) # Invalid UTF-8 Sequence Start
    | [xF0-xF4](?![x80-xBF]{3}) # Invalid UTF-8 Sequence Start
    | (?<=[x00-x7FxF5-xFF])[x80-xBF] # Invalid UTF-8 Sequence Middle
    | (?<![xC2-xDF]|[xE0-xEF]|[xE0-xEF][x80-xBF]|[xF0-xF4]|[xF0-xF4][x80-xBF]|[xF0-xF4][x80-xBF]{2})[x80-xBF] # Overlong Sequence
    | (?<=[xE0-xEF])[x80-xBF](?![x80-xBF]) # Short 3 byte sequence
    | (?<=[xF0-xF4])[x80-xBF](?![x80-xBF]{2}) # Short 4 byte sequence
    | (?<=[xF0-xF4][x80-xBF])[x80-xBF](?![x80-xBF]) # Short 4 byte sequence (2)
)/x';

We can test it by creating a few variations of text:

// Overlong encoding of code point 0
$text = chr(0xC0) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 5 byte encoding
$text = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 6 byte encoding
$text = chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);        
var_dump(preg_match($regex, $text)); // int(1)
// High code-point without trailing characters
$text = chr(0xD0) . chr(0x01);
var_dump(preg_match($regex, $text)); // int(1)

etc...

In fact, since this matches invalid bytes, you could then use it in preg_replace to replace them away:

preg_replace($regex, '', $text); // Remove all invalid UTF-8 code-points

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...