Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

On our application, we need to search inside the blobs' content. I have already looked at Azure Cognitive Search but the maximum size of a blob is 256MB and we have blobs larger than that. I searched for other alternatives that support indexing & searching on huge blobs, but couldn't find any. Is there something we can use? Thanks


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
441 views
Welcome To Ask or Share your Answers For Others

1 Answer

Typically in cases where you have blobs so large I think it is best to pre-process them. This also has the advantage of having it staged in case you ever need to geo-replication or quickly restore from backup. For example in Azure Functions there are Blob triggers that can be fired to execute some code. In this, you could leverage Apache Tika to extract the text from the files and store them back to a separate blob container. Then have Cognitive Search pick up the extracted text from there. Please note, extracting this much text from files this large can be quite compute and memory intensive, so it is possible that your pre-processing might actually need some higher compute / memory.

The code is a little older now, but hopefully this example of using TikaDotNet in an Azure Function might also help: https://github.com/liamca/AzureSearch-AzureFunctions-CognitiveServices/blob/master/ApacheTika/run.csx

Please note, I have never tried this code on a file so large though.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share

548k questions

547k answers

4 comments

86.3k users

...