Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have the following task: Build a personal dictionary for chinese characters. Users choose single chinese characters from a list. The software then goes through a list of combinations of characters and filters out all that contain characters that are not in the users list of single characters. So if the user studied 1(一) and 10 (十), then 11 (十一) should be shown, but not 12 (十二).

The next issue is that there are about 12k single characters and 100k combinations. The whole list can become very long. Currently I am facing the following issue: MySQL does not seem to be able to do proper REGEX matching with unicode characters. PHP can however. When I do a MySQL query (see below), I get a lot of false positives. I have to filter the results with PHP afterwards again. The whole thing takes a lot of time. I have now a sample list of 180 single charachters that are matched in a SQL regex as below. The result of the SQL are over 30'000 combinations. To do that SQL call takes about 6 seconds on the machine that I am running on. When I check the results with PHP afterwards, the result are only 1182 combinations. That's a lot of false positives. On top of that, checking the results takes another couple of seconds. With each single character I add to the list, the time increases by about half a second. A more effective method is needed urgently.

To tackle the issue, I first need to figure out why MySQL has so many false positives:

If I do regular expressions with PHP, I use a /regex/u to indicate that the subject is unicode and this gives me correct results.

In MySQL however, I do not know how do set such a flag. All REGEXP 'regex' results are returned in the same way as if I used PHP preg_match('/regex/', $subject) instead of /regex/u.

I tried to change the collation of the result to various utf8_* but it would not change the result. Also adding a fulltext index over the database did not do anything.

Here is a testing-function that I wrote to highlight the issue. if you have any other ideas for checks to build in there to drill down on the problem please tell me.

$db = mysql_connect('localhost', 'kanji', '************');
$link =  mysql_select_db('kanji_data', $db);
mysql_query('SET character_set_results=utf8');
mysql_query('SET names=utf8');
mysql_query('SET character_set_client=utf8');
mysql_query('SET character_set_connection=utf8');
mysql_query('SET character_set_results=utf8');
mysql_query('SET collation_connection=utf8_general_ci');
mysql_set_charset('utf8');

echo '<pre>debug: encoding=' .mysql_client_encoding(). '</pre>';
$string = '三|二|四|一|五';
$sql = "SELECT simplified, length(simplified), searchindex FROM chinese WHERE strlen>0 AND simplified REGEXP '($string)+';";
$sql_encoding = mb_detect_encoding($sql);
echo '<pre>debug: sql string encoding: ' . $sql_encoding . '</pre>';
echo '<pre>debug: sql string: ' . $sql . '</pre>';
// echo $sql;
$rst = mysql_query($sql);
echo mysql_errno($db) . ": " . mysql_error($db). "
";
while ($row = mysql_fetch_array($rst, MYSQL_NUM)) {
    $len = mb_strlen($row[0]);
    $result_encoding =  mb_detect_encoding($row[0]);
    $pattern = "/^(三|二|四|一|五)+$/u";
    preg_match($pattern, $row[0], $matches);
    if (count($matches) == 0) {
        echo "ERROR: ";
    }
    echo 'string: '. $row[0] . ' ('.$row[1] .' long mysql, '.$len.' long php, encoding: '.$result_encoding.')'.$row[2] ."<br>


";
}

The result of the function can be see on this website.

If I am doing something completely wrong to achieve the required result, I am also happy to tackle this one differently.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
407 views
Welcome To Ask or Share your Answers For Others

1 Answer

The issue at hand is that MySQL cannot do REGEX in combination with Unicode characters properly at all. REGEX always works on a byte-basis and Unicode needs to group several bytes. There is no solution, only workarounds.

One workaround that I did was indexing every letter occurrences with another table and then running checks on the index instead of the original Japanese text.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...