With larger datasets, window functions are the most efficient way to perform these kinds of queries -- the table will be scanned only once, instead of once for each date, like a self-join would do. It also looks a lot simpler. :) PostgreSQL 8.4 and up have support for window functions.
This is what it looks like:
SELECT created_at, sum(count(email)) OVER (ORDER BY created_at)
FROM subscriptions
GROUP BY created_at;
Here OVER
creates the window; ORDER BY created_at
means that it has to sum up the counts in created_at
order.
Edit: If you want to remove duplicate emails within a single day, you can use sum(count(distinct email))
. Unfortunately this won't remove duplicates that cross different dates.
If you want to remove all duplicates, I think the easiest is to use a subquery and DISTINCT ON
. This will attribute emails to their earliest date (because I'm sorting by created_at in ascending order, it'll choose the earliest one):
SELECT created_at, sum(count(email)) OVER (ORDER BY created_at)
FROM (
SELECT DISTINCT ON (email) created_at, email
FROM subscriptions ORDER BY email, created_at
) AS subq
GROUP BY created_at;
If you create an index on (email, created_at)
, this query shouldn't be too slow either.
(If you want to test, this is how I created the sample dataset)
create table subscriptions as
select date '2000-04-04' + (i/10000)::int as created_at,
'foofoobar@foobar.com' || (i%700000)::text as email
from generate_series(1,1000000) i;
create index on subscriptions (email, created_at);
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…