Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I got a directory that contains files for users of a program I have. There are around 70k json files in that directory.

The current search method is using glob and foreach. It's getting quite slow and hogging the server. Is there any good way to search through these files more efficiently? I'm running this on a Ubuntu 16.04 machine and I can use exec if needed.

Update:

Theses are json files and each file needs to be opened to check if it contains the search query or not. Looping over the files is quite fast, but when it needs to open each file, it takes quite a while.

These cannot be indexed using SQL or memcached, as I'm using memcached for some other things.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
250 views
Welcome To Ask or Share your Answers For Others

1 Answer

As you implied yourself, to make this the most performant search possible, you need to hand over the task to a tool that is designed for this purpose.

I say, go beyond grep and see what's even better than ack. Also, see ag and then settle for ripgrep as it's the best of its kind in the town.


Experiment

I did a little experiment with ack on a low-spec laptop. I searched for an existing class name within 19,501 files. Here's the results:

$ cd ~/Dev/php/packages
$ ack -f | wc -l 
19501

$ time ack PHPUnitSeleniumTestCase | wc -l
10
ack PHPUnitSeleniumTestCase  7.68s user 2.99s system 21% cpu 48.832 total
wc -l  0.00s user 0.00s system 0% cpu 48.822 total

I did the same experiment, this time with ag. And it really surprised me:

$ time ag PHPUnitSeleniumTestCase | wc -l
10
ag PHPUnitSeleniumTestCase  0.24s user 0.98s system 13% cpu 9.379 total
wc -l  0.00s user 0.00s system 0% cpu 9.378 total

I was so excited with the results, I went on and tried ripgrep as well. Even better:

$ time rg PHPUnitSeleniumTestCase | wc -l
10
rg PHPUnitSeleniumTestCase  0.44s user 0.27s system 19% cpu 3.559 total
wc -l  0.00s user 0.00s system 0% cpu 3.558 total

Experiment with this family of tools, see what best suits your needs.


P.S. ripgrep's original author has left a comment under this post, saying that ripgrep is faster than {grep, ag, git grep, ucg, pt, sift}. Interesting read, fabulous work.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...