Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

First, please note, that I am interested in how something like this would work, and am not intending to build it for a client etc, as I'm sure there may already be open source implementations.

How do the algorithms work which detect plagiarism in uploaded text? Does it use regex to send all words to an index, strip out known words like 'the', 'a', etc and then see how many words are the same in different essays? Does it them have a magic number of identical words which flag it as a possible duplicate? Does it use levenshtein()?

My language of choice is PHP.

UPDATE

I'm thinking of not checking for plagiarism globally, but more say in 30 uploaded essays from a class. In case students have gotten together on a strictly one person assignment.

Here is an online site that claims to do so: http://www.plagiarism.org/

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
429 views
Welcome To Ask or Share your Answers For Others

1 Answer

Good plagiarism detection will apply heuristics based on the type of document (e.g. an essay or program code in a specific language).

However, you can also apply a general solution. Have a look at the Normalized Compression Distance (NCD). Obviously you cannot exactly calculate a text's Kolmogorov complexity, but you can approach it be simply compressing the text.

A smaller NCD indicates that two texts are more similar. Some compression algorithms will give better results than others. Luckily PHP provides support for several compression algorithms, so you can have your NCD-driven plagiarism detection code running in no-time. Below I'll give example code which uses Zlib:

PHP:

function ncd($x, $y) { 
  $cx = strlen(gzcompress($x));
  $cy = strlen(gzcompress($y));
  return (strlen(gzcompress($x . $y)) - min($cx, $cy)) / max($cx, $cy);
}   

print(ncd('this is a test', 'this was a test'));
print(ncd('this is a test', 'this text is completely different'));

Python:

>>> from zlib import compress as c
>>> def ncd(x, y): 
...     cx, cy = len(c(x)), len(c(y))
...     return (len(c(x + y)) - min(cx, cy)) / max(cx, cy) 
... 
>>> ncd('this is a test', 'this was a test')
0.30434782608695654
>>> ncd('this is a test', 'this text is completely different')
0.74358974358974361

Note that for larger texts (read: actual files) the results will be much more pronounced. Give it a try and report your experiences!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...