Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have random text stored in $sentences. Using regex, I want to split the text into sentences, see:

function splitSentences($text) {
    $re = '/                # Split sentences on whitespace between them.
        (?<=                # Begin positive lookbehind.
          [.!?]             # Either an end of sentence punct,
        | [.!?]['"]        # or end of sentence punct and quote.
        )                   # End positive lookbehind.
        (?<!                # Begin negative lookbehind.
          Mr.              # Skip either "Mr."
        | Mrs.             # or "Mrs.",
        | T.V.A.         # or "T.V.A.",
                            # or... (you get the idea).
        )                   # End negative lookbehind.
        s+                 # Split on whitespace between sentences.
        /ix';

    $sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
    return $sentences;
}

$sentences = splitSentences($sentences);

print_r($sentences);

It works fine.

However, it doesn't split into sentences if there are unicode characters:

$sentences = 'Entertainment media properties.? Fairy Tail and Tokyo Ghoul.';

Or this scenario:

$sentences = "Entertainment media properties.&Acirc;&nbsp; Fairy Tail and Tokyo Ghoul.";

What can I do to make it work when unicode characters exist in the text?

Here is an ideone for testing.

Bounty info

I am looking for a complete solution to this. Before posting an answer, please read the comment thread I had with WiktorStribi?ew for more relevant info on this issue.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
454 views
Welcome To Ask or Share your Answers For Others

1 Answer

As it should be expected, any sort of natural language processing is not a trivial task. The reason for it is that they are evolutionary systems. There is no single person who sat down and thought about which are good ideas and which - not. Every rule has 20-40% exceptions. With that said the complexity of a single regex that can do your bidding would be off the charts. Still, the following solution relies mainly on regexes.


  • The idea is to gradually go over the text.
  • At any given time, the current chunk of the text will be contained in two different parts. One, which is the candidate for a substring before a sentence boundary and another - after.
  • The first 10 regex pairs detect positions which look like sentence boundaries, but actually aren't. In that case, before and after are advanced without registering a new sentence.
  • If none of these pairs matches, matching will be attempted with the last 3 pairs, possibly detecting a boundary.

As for where did these regexes come from? - I translated this Ruby library, which is generated based on this paper. If you truly want to understand them, there is no alternative but to read the paper.

As far as accuracy goes - I encourage you to test it with different texts. After some experimentation, I was very pleasantly surprised.

In terms of performance - the regexes should be highly performant as all of them have either a A or anchor, there are almost no repetition quantifiers, and in the places there are - there can't be any backtracking. Still, regexes are regexes. You will have to do some benchmarking if you plan to use this is tight loops on huge chunks of text.


Mandatory disclaimer: excuse my rusty php skills. The following code might not be the most idiomatic php ever, it should still be clear enough to get the point across.


function sentence_split($text) {
    $before_regexes = array('/(?:(?:['"?][.!?…]['"”]s)|(?:[^.]s[A-Z].s)|(?:(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd).s)|(?:(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd).s[A-Z].s)|(?:Apr.s)|(?:Aug.s)|(?:Bros.s)|(?:Co.s)|(?:Corp.s)|(?:Dec.s)|(?:Dist.s)|(?:Feb.s)|(?:Inc.s)|(?:Jan.s)|(?:Jul.s)|(?:Jun.s)|(?:Mar.s)|(?:Nov.s)|(?:Oct.s)|(?:Ph.?D.s)|(?:Sept?.s)|(?:p{Lu}.p{Lu}.s)|(?:p{Lu}.sp{Lu}.s)|(?:cf.s)|(?:e.g.s)|(?:esp.s)|(?:etsal.s)|(?:vs.s)|(?:p{Ps}[!?]+p{Pe} ))/su',
        '/(?:(?:[.s]p{L}{1,2}.s))/su',
        '/(?:(?:[[(]*...[])]* ))/su',
        '/(?:(?:(?:pp|[Vv]iz|i.?s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c.?s*f|vs).s))/su',
        '/(?:(?:[Ee]tc.s))/su',
        '/(?:(?:[.!?…]+p{Pe} )|(?:[[(]*…[])]* ))/su',
        '/(?:(?:p{L}.))/su',
        '/(?:(?:p{L}.s))/su',
        '/(?:(?:[Ff]igs?.s)|(?:[nN]o.s))/su',
        '/(?:(?:["”']s*))/su',
        '/(?:(?:[.!?…][x{00BB}x{2019}x{201D}x{203A}"'p{Pe}x{0002}]*s)|(?:
?
))/su',
        '/(?:(?:[.!?…]['"x{00BB}x{2019}x{201D}x{203A}p{Pe}x{0002}]*))/su',
        '/(?:(?:sp{L}[.!?…]s))/su');
    $after_regexes = array('/A(?:)/su',
        '/A(?:[p{N}p{Ll}])/su',
        '/A(?:[^p{Lu}])/su',
        '/A(?:[^p{Lu}]|I)/su',
        '/A(?:[^p{Lu}])/su',
        '/A(?:p{Ll})/su',
        '/A(?:p{L}.)/su',
        '/A(?:p{L}.s)/su',
        '/A(?:p{N})/su',
        '/A(?:s*p{Ll})/su',
        '/A(?:)/su',
        '/A(?:p{Lu}[^p{Lu}])/su',
        '/A(?:p{Lu}p{Ll})/su');
    $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
    $count = 13;

    $sentences = array();
    $sentence = '';
    $before = '';
    $after = substr($text, 0, 10);
    $text = substr($text, 10);

    while($text != '') {
        for($i = 0; $i < $count; $i++) {
            if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
                if($is_sentence_boundary[$i]) {
                    array_push($sentences, $sentence);
                    $sentence = '';
                }
                break;
            }
        }

        $first_from_text = $text[0];
        $text = substr($text, 1);
        $first_from_after = $after[0];
        $after = substr($after, 1);
        $before .= $first_from_after;
        $sentence .= $first_from_after;
        $after .= $first_from_text;
    }

    if($sentence != '' && $after != '') {
        array_push($sentences, $sentence.$after);
    }

    return $sentences;
}

$text = "Mr. Entertainment media properties.? Fairy Tail 3.5 and Tokyo Ghoul.";
print_r(sentence_split($text));

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...