Time series are seen everywhere – your web event log, temperature readings, startup venture capital financings and your customer acquisition data. If you squint enough at any dataset, you’ll immediately see a time axis. Here are four tools to work with time series data in PostgresSQL & Redshift…
An example of a time series
A time series data set has a simple structure, for example your row might look something like this:
Timestamp | User | Event | Event Properties |
---|---|---|---|
2016-01-01 | user_3dfc123 | Signup | {} |
2016-01-02 | user_3dfc123 | Confirm Email | {} |
2016-01-03 | … | … | … |
This is an example of a normalized table (you’ll have to join with the user table to find out the user properties like name, location, etc.)
The table is structured around a user, the event name and the properties related to that event. And ofcourse, being a time series, there’s a timestamp of when the event was generated.
Tool 1: Use date_trunc
to Group & Aggregate by custom periods
The date_trunc
function can be used to truncate a timestamp to a specified precision. Postgres supports a wide variety of precisions: second, minute, hour, day, week, month, quarter, year, etc.
Examples:
> SELECT date_trunc('hour', '2016-01-01 17:27:15'::timestamp)
date_trunc
---------------------
2016-01-01 17:00:00
(1 row)
> SELECT date_trunc('month', '2016-01-01 17:27:15'::timestamp);
date_trunc
---------------------
2016-01-01 00:00:00
(1 row)
The date_trunc
aggregate function along with GROUP BY
comes handy when you only want to look at a metric by a custom interval – for example, you can find the average of a metric for a month, or a quarter:
> SELECT date_trunc('month', timestamp),
AVG(value) as metric
FROM event_table
GROUP BY 1;
Tool 2: Use generate_series
to plug gaps in the time interval
Sometimes, you’ll have many gaps in your time series because of a lack of data. For example, you were to inspect our visitors table, there were days where we had no visitors. If we were to chart this time series, we’ll get a distorted visual:
Timestamp | Visitors |
---|---|
01-Jan-2016 | 10 |
02-Jan-2016 | 7 |
05-Jan-2016 | 10 |
06-Jan-2016 | 15 |
Ideally, we want to insert a zero for these missing days so that our chart isn’t misleading and is properly spaced out:
Timestamp | Visitors |
---|---|
01-Jan-2016 | 10 |
02-Jan-2016 | 7 |
03-Jan-2016 | 0 |
04-Jan-2016 | 0 |
05-Jan-2016 | 10 |
06-Jan-2016 | 15 |
Solution
Postgres provides generate_series
to generate a continuous series – given a start, end and step interval:
> SELECT * FROM generate_series(2,10,2);
generate_series
-----------------
2
4
6
8
10
(5 rows)
What’s cool is that postgres natively understands datetime data types and can create a series with these data types:
> SELECT * FROM generate_series('2016-01-01'::date, '2016-01-05'::date, '1 day');
generate_series
------------------------
2016-01-01 00:00:00-08
2016-01-02 00:00:00-08
2016-01-03 00:00:00-08
2016-01-04 00:00:00-08
2016-01-05 00:00:00-08
(5 rows)
You can now use a LEFT JOIN
with your data table to plug gaps and make sure each data points are regularly spaced out.
Unfortunately, generate_series
is not supported on Redshift. We’ll cover a workaround in a future article.
Tool 3: Month-over-Month growth rates
To benchmark your business, you’ll want to compute month-over-month or week-over-week growth rates. In the case of Silota, we are not only interested in the number of charts created monthly, but also their growth rates on a month-to-month basis.
> SELECT date_trunc('month', timestamp) as Date,
count(*) as Count
FROM events
WHERE event_name = 'Created Chart'
GROUP BY 1
ORDER BY 1
Date | Count |
---|---|
2016-01-01 | 10 |
2016-02-01 | 12 |
2016-03-01 | 15 |
The above query should give us a neat table with the number of charts created every month. To compute the growth rates, we use window functions and the lag
function. First to understand how the lag function works:
> SELECT date_trunc('month', timestamp) as Date,
count(*) as Count,
lag(count(*), 1) over timestamp
FROM events
WHERE event_name = 'Created Chart'
GROUP BY 1
ORDER BY 1
Date | Count | lag |
---|---|---|
2016-01-01 | 10 | |
2016-02-01 | 12 | 10 |
2016-03-01 | 15 | 12 |
The lag
function returns a value evaluated at the row that is definable offset before the current row within the partition. In this particular we have simply picked the value from the previous row (offset of 1). To compute growth rates, it’s just a matter of subtracting the current value from the previous value:
> SELECT date_trunc('month', timestamp) as Date,
count(*) as Count,
100 * (count(*) - lag(count(*), 1) over (order by timestamp)) / lag(count(*), 1) over (order by timestamp)) || '%' as growth
FROM events
WHERE event_name = 'Created Chart'
GROUP BY 1
ORDER BY 1
Date | Count | growth |
---|---|---|
2016-01-01 | 10 | |
2016-02-01 | 12 | 20% |
2016-03-01 | 15 | 25% |
Tool 4: Running & Cumulative Sums
You might have a count of a particular event aggregated by day, for example with a query:
> SELECT date_trunc('day', timestamp) as Date,
COUNT(metric) as Count
FROM event_table
GROUP BY 1
ORDER BY 1;
Date | Count |
---|---|
2016-01-01 | 10 |
2016-01-02 | 20 |
2016-01-03 | 30 |
2016-02-01 | 10 |
2016-02-02 | 20 |
2016-02-03 | 30 |
Instead, you want to get a cumulative sum of the counts every day. In Postgres, window functions are the most efficient way to perform these kinds of operations.
> SELECT date_trunc('day', timestamp) as Date,
COUNT(metric) as Count,
SUM(COUNT(metric)) OVER (ORDER BY date_trunc('day', timestamp)) as CumSum
FROM event_table
GROUP BY 1
ORDER BY 1;
Date | Count | CumSum |
---|---|---|
2016-01-01 | 10 | 10 |
2016-01-02 | 20 | 30 |
2016-01-03 | 30 | 60 |
2016-02-01 | 10 | 70 |
2016-02-02 | 20 | 90 |
2016-02-03 | 30 | 120 |
In the above query, the OVER
creates the window, the ORDER BY
means it has to sum up the counts in the timestamp order.
If instead you want to reset the running total every month, you’d use the PARTITION BY
clause inside the window function:
> SELECT date_trunc('day', timestamp) as Date,
COUNT(metric) as Count,
SUM(COUNT(metric)) OVER ( PARTITION BY date_trunc('month', timestamp) ORDER BY date_trunc('day', timestamp)) as CumSum
FROM event_table
GROUP BY date_trunc('month', timestamp), 1
ORDER BY 1;
Date | Count | CumSum |
---|---|---|
2016-01-01 | 10 | 10 |
2016-01-02 | 20 | 30 |
2016-01-03 | 30 | 60 |
2016-02-01 | 10 | 10 |
2016-02-02 | 20 | 30 |
2016-02-03 | 30 | 60 |
Photo Credit: Unsplash