Suppose, for illustrative purposes, you are running a library using a simple MySQL "books" table with three columns:
(id, title, status)
- id is the primary key
- title is the title of the book
- status could be an enum describing the book's current state (e.g. AVAILABLE, CHECKEDOUT, PROCESSING, MISSING)
A simple query to report how many books fall into each state is:
SELECT status, COUNT(*) FROM books GROUP BY status
or to specifically find how many books are available:
SELECT COUNT(*) FROM books WHERE status = "AVAILABLE"
However, once the table grows to millions of rows, these queries take several seconds to complete. Adding an index to the "status" column doesn't appear to make a difference in my experience.
Aside from periodically caching the results or explicitly updating summary info in a separate table each time a book changes state (via triggers or some other mechanism), are there any techniques for speeding up these kinds of queries? It seems that the COUNT queries end up looking at every row, and (without knowing more details) I'm a bit surprised that this information can't somehow be determined from the index.
UPDATE
Using the sample table (with an indexed "status" column) with 2 million rows, I benchmarked the GROUP BY query. Using the InnoDB storage engine, the query takes 3.0 - 3.2 seconds on my machine. Using MyISAM, the query takes 0.9 - 1.1 seconds. There was no significant difference between count(*), count(status), or count(1) in either case.
MyISAM is admittedly a bit faster, but I was curious to see if there was a way to make an equivalent query run much faster (e.g. 10-50 ms -- fast enough to be called on every webpage request for a low-traffic site) without the mental overhead of caching and triggers. It sounds like the answer is "there's no way to run the direct query quickly" which is what I expected - I just wanted to make sure I wasn't missing an easy alternative.
Question&Answers:os