Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I was working on the refinement of this answer; and figured out that the regex given below is not working properly(as per its meaning) in R.

 +?on.*$

According to my understanding of regex, the above regex matches:

lazily space one or more times followed by on followed by anything(except newline) till the end.

INPUT:

Posted by ondrej on 29 Feb 2020.
Posted by ona'je on 29 Feb 2020.

OUTPUT (according to me, if above regex pattern in test string is replaced by "")

Posted by
Posted by 

And when I'm trying to test it in python (implementation here), javascript and java (implementation here); I'm getting the result as I expected.

const myString = "Posted by ondrej on 29 Feb 2020.
Posted by ona'je on";

console.log(myString.replace( new RegExp(" +?on.*$","gm"),""));
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
252 views
Welcome To Ask or Share your Answers For Others

1 Answer

It looks like TRE regex engine (used by default in base R regex functions), based on the regex library initially written by Henry Spencer in 1986, matches the shortest match at the end of the string if the first pattern in the regular expression starts with a lazy quantifier and ends with $ anchor.

Compare these cases:

sub(" +?on.*$", "", Data)  # "Posted by ondrej" "Posted by ona'je"
sub(" +?on.*", "", Data)   # "Posted bydrej on 29 Feb 2020." "Posted bya'je on 29feb 2020"
sub(" +?on(.*)", "", Data) # as expected
sub(" +on.*", "", Data)    # as expected

What is going on?

  • The first case is sub(" +?on.*$", "", Data) and the first pattern sets the greediness of all the quantifiers on the same level in the regex. So, the second quantifier, *, will be set to lazy even without ? after it as the first space was quantified with +?, a lazy quantifier. It is a known TRE "bug", also present in some other regex engines based on Henry Spencer's regexl library.

  • The second sub(" +?on.*", "", Data) matches the same way as if it were written " +?on.*?" (again, due to the first pattern setting the greediness level to lazy on that level) and that would only match 1 or more spaces and then on, .*? matches nothing when at the end of the pattern.

  • The third one, sub(" +?on(.*)", "", Data), yields the expected results because the second quantified pattern, .*, is on the other level (one level deep) and its greediness is not affected by the +? that is on another level. So, (.*) matches greedily here.

  • The fourth one, sub(" +on.*", "", Data), yields the expected results because the first pattern is greedy, so the next quantified pattern greediness is also greedy.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...