Filling Missing Data and Gaps by Generating a Continuous Series in SQL
Often your business processes will store data in a database in a sparse format, i.e., if no data exists for a given dimension, no row will exist. But for reporting purposes, you will still want to show a zero next to a particular dimension even though no data exists in the database. It is important to have a consistent number of data points for period-to-period comparisons, and other analytical window functions like moving averages and running sums.
Product Sales from the database
We want to find out the total sum of products sold on a day-by-day basis. A simple query like this will give us that:
select dt, sum(sales) as sales from transactions
group by dt
order by dt;
dt | sales |
---|---|
2016-01-05 | 27 |
2016-01-06 | 4 |
2016-01-07 | 28 |
2016-01-08 | 25 |
... | ... |
The problem with this query is that for certain dates we have zero sales. By omitting that data from the result it appears as if we've sold products every single day. That's misleading.
Solution: Use generate_series
with a LEFT JOIN
PostgreSQL has an function called generate_series
that returns a continuous series as multiple rows. For the purposes of this example, we'll generate a continuous series of dates in the month of Jan 2016:
select generate_series('2016-01-01'::date, '2016-01-31'::date, '1 day'::interval)
generate_series |
---|
2016-01-01 |
2016-01-02 |
2016-01-03 |
2016-01-04 |
... |
Now that we have a continuous series of rows, we'll use a LEFT JOIN
to generate another table with the gaps plugged:
select series as dt,
sum(sales) as sales
from generate_series('2016-01-01'::date, '2016-01-31'::date, '1 day'::interval) as series
left join transactions on transactions.dt = series
dt | sales |
---|---|
2016-01-01 | Ø |
2016-01-02 | Ø |
2016-01-03 | Ø |
2016-01-04 | Ø |
2016-01-05 | 27 |
2016-01-06 | 4 |
2016-01-07 | 28 |
2016-01-08 | 25 |
... | ... |
As you can see, we now have data for every day of the month, even though the missing data has been filled with a NULL
(Ø). We'll use the COALESCE
function convert these null values to zero.
select series as dt,
coalesce(sum(sales), 0) as sales
from generate_series('2016-01-01'::date, '2016-01-31'::date, '1 day'::interval) as series
left join transactions on transactions.dt = series
Our final result can be visualized as:
Notes
- You can use the generate series function and left joins to fill gaps along any dimension. The time dimension is the most common because it's the most freqeuntly used dimensions for comparisons.
- Generate series for MySQL
- Generate series for Redshift
No spam, ever! Unsubscribe any time. See past emails here.