string - Algorithm to find articles with similar text

Question

Welcome To Ask or Share your Answers For Others

string - Algorithm to find articles with similar text

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

I have many articles in a database (with title,text), I'm looking for an algorithm to find the X most similar articles, something like Stack Overflow's "Related Questions" when you ask a question.

I tried googling for this but only found pages about other "similar text" issues, something like comparing every article with all the others and storing a similarity somewhere. SO does this in "real time" on text that I just typed.

How?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

774 views

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:08:45+0000

Edit distance isn't a likely candidate, as it would be spelling/word-order dependent, and much more computationally expensive than Will is leading you to believe, considering the size and number of the documents you'd actually be interested in searching.

Something like Lucene is the way to go. You index all your documents, and then when you want to find documents similar to a given document, you turn your given document into a query, and search the index. Internally Lucene will be using tf-idf and an inverted index to make the whole process take an amount of time proportional to the number of documents that could possibly match, not the total number of documents in the collection.

Categories

string - Algorithm to find articles with similar text

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags