I highly doubt I'm doing this in the most efficient manner, which is why I tagged plpgsql
on here. I need to run this on 2 billion rows for a thousand measurement systems.
You have measurement systems that often report the previous value when they lose connectivity, and they lose connectivity for spurts often but sometimes for a long time. You need to aggregate but when you do so, you need to look at how long it was repeating and make various filters based on that information. Say you are measuring mpg on a car but it's stuck at 20 mpg for an hour than moves around to 20.1 and so on. You'll want to evaluate the accuracy when it's stuck. You could also place some alternative rules that look for when the car is on the highway, and with window functions you can generate the 'state' of the car and have something to group on. Without further ado:
--here's my data, you have different systems, the time of measurement, and the actual measurement
--as well, the raw data has whether or not it's a repeat (hense the included window function
select * into temporary table cumulative_repeat_calculator_data
FROM
(
select
system_measured, time_of_measurement, measurement,
case when
measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc)
then 1 else 0 end as repeat
FROM
(
SELECT 5 as measurement, 1 as time_of_measurement, 1 as system_measured
UNION
SELECT 150 as measurement, 2 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 1 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 2 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 2 as system_measured
UNION
SELECT 150 as measurement, 5 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 6 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 7 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 8 as time_of_measurement, 2 as system_measured
) as data
) as data;
--unfortunately you can't have window functions within window functions, so I had to break it down into subquery
--what we need is something to partion on, the 'state' of the system if you will, so I ran a running total of the nonrepeats
--this creates a row that stays the same when your data is repeating - aka something you can partition/group on
select * into temporary table cumulative_repeat_calculator_step_1
FROM
(
select
*,
sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system
from cumulative_repeat_calculator_data
order by system_measured, time_of_measurement
) as data;
--finally, the query. I didn't bother showing my desired output, because this (finally) got it
--I wanted a sequential count of repeats that restarts when it stops repeating, and starts with the first repeat
--what you can do now is take the average measurement under some condition based on how long it was repeating, for example
select *,
case when repeat = 0 then 0
else
row_number() over (partition by cumlative_sum_of_nonrepeats_by_system, system_measured order by time_of_measurement) - 1
end as ordered_repeat
from cumulative_repeat_calculator_step_1
order by system_measured, time_of_measurement
So, what would you do differently in order to run this on a huge table, or what alternative tools would you use? I'm thinking plpgsql because I suspect this needs to done in-database, or during the data insertion process, although I generally work with the data after it's loaded. Is there any way to get this in one sweep without resorting to sub-queries?
I have tested one alternative method, but it still relies on a sub-query and I think this is faster. For that method you create a "starts and stops" table with start_timestamp, end_timestamp, system. Then you join to the larger table and if the timestamp is between those, you classify it as being in that state, which is essentially an alternative to cumlative_sum_of_nonrepeats_by_system
. But when you do this, you join on 1=1 for thousands of devices and thousands or millions of 'events'. Do you think that's a better way to go?