c# - Text indexing algorithm

Question

Welcome To Ask or Share your Answers For Others

c# - Text indexing algorithm

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I am writing a C# winform application for an archiving system. The system has a huge database where some tables would have more than 1.5 million records. What i need is an algorithm that indexes the content of these records. Mainly, the files are Microsoft office, PDF and TXT documents. anyone can help? whether with ideas, links, books or codes, I appreciate it :)

example: if i search for the word "international" in a certain folder in the database, i get all the files that contain that word ordered by a certain criteria such as relevance, modifying date...etc

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

230 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:36:21+0000

You need to create, what is known as an inverted index - which is at the core of how search engines work (a la Google). Apache Lucene is arguably the best library for inverted indexing. You have 2 options:

Lucene.net - a .NET port of the Java Lucene library.
Apache Solr - a full-fledged search server built using Lucene libs and easily integrable into your .NET application because it has a RESTful API. Comes out-of-the-box with several features such as caching, scaling, spell-checking, etc. You can make life easier for your app-to-Solr interaction using the excellent SolrNet library.
Apache Tika offers a very extensive data/metadata extraction toolkit working with PDFs, HTMLs, MS Office docs etc. A simpler option would be to the IFilter API. See this article for more details.

Categories

c# - Text indexing algorithm

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags