4 Must Have Tools for PostgresSQL & Redshift Time series Data Analysis

Estimated reading time: 6 minutes.

Time series are seen everywhere – your web event log, temperature readings, startup venture capital financings and your customer acquisition data. If you squint enough at any dataset, you’ll immediately see a time axis. Here are four tools to work with time series data in PostgresSQL & Redshift…

An example of a time series

A time series data set has a simple structure, for example your row might look something like this:

Timestamp	User	Event	Event Properties
2016-01-01	user_3dfc123	Signup	{}
2016-01-02	user_3dfc123	Confirm Email	{}
2016-01-03	…	…	…

This is an example of a normalized table (you’ll have to join with the user table to find out the user properties like name, location, etc.)

The table is structured around a user, the event name and the properties related to that event. And ofcourse, being a time series, there’s a timestamp of when the event was generated.

Tool 1: Use `date_trunc` to Group & Aggregate by custom periods

The date_trunc function can be used to truncate a timestamp to a specified precision. Postgres supports a wide variety of precisions: second, minute, hour, day, week, month, quarter, year, etc.

Examples:

> SELECT date_trunc('hour', '2016-01-01 17:27:15'::timestamp)
date_trunc      
---------------------
2016-01-01 17:00:00
(1 row)

> SELECT date_trunc('month', '2016-01-01 17:27:15'::timestamp);
date_trunc      
---------------------
2016-01-01 00:00:00
(1 row)

The date_trunc aggregate function along with GROUP BY comes handy when you only want to look at a metric by a custom interval – for example, you can find the average of a metric for a month, or a quarter:

> SELECT date_trunc('month', timestamp),
         AVG(value) as metric
  FROM event_table
  GROUP BY 1;

Tool 2: Use `generate_series` to plug gaps in the time interval

Sometimes, you’ll have many gaps in your time series because of a lack of data. For example, you were to inspect our visitors table, there were days where we had no visitors. If we were to chart this time series, we’ll get a distorted visual:

Timestamp	Visitors
01-Jan-2016	10
02-Jan-2016	7
05-Jan-2016	10
06-Jan-2016	15

Ideally, we want to insert a zero for these missing days so that our chart isn’t misleading and is properly spaced out:

Timestamp	Visitors
01-Jan-2016	10
02-Jan-2016	7
03-Jan-2016	0
04-Jan-2016	0
05-Jan-2016	10
06-Jan-2016	15

Solution

Postgres provides generate_series to generate a continuous series – given a start, end and step interval:

> SELECT * FROM generate_series(2,10,2);
generate_series
-----------------
              2
              4
              6
              8
             10
(5 rows)

What’s cool is that postgres natively understands datetime data types and can create a series with these data types:

> SELECT * FROM generate_series('2016-01-01'::date, '2016-01-05'::date, '1 day');
generate_series     
------------------------
2016-01-01 00:00:00-08
2016-01-02 00:00:00-08
2016-01-03 00:00:00-08
2016-01-04 00:00:00-08
2016-01-05 00:00:00-08
(5 rows)

You can now use a LEFT JOIN with your data table to plug gaps and make sure each data points are regularly spaced out.

Unfortunately, generate_series is not supported on Redshift. We’ll cover a workaround in a future article.

Tool 3: Month-over-Month growth rates

To benchmark your business, you’ll want to compute month-over-month or week-over-week growth rates. In the case of Silota, we are not only interested in the number of charts created monthly, but also their growth rates on a month-to-month basis.

> SELECT date_trunc('month', timestamp) as Date,
         count(*) as Count
  FROM events
  WHERE event_name = 'Created Chart'
  GROUP BY 1
  ORDER BY 1

Date	Count
2016-01-01	10
2016-02-01	12
2016-03-01	15

The above query should give us a neat table with the number of charts created every month. To compute the growth rates, we use window functions and the lag function. First to understand how the lag function works:

> SELECT date_trunc('month', timestamp) as Date,
         count(*) as Count,
         lag(count(*), 1) over timestamp
  FROM events
  WHERE event_name = 'Created Chart'
  GROUP BY 1
  ORDER BY 1

Date	Count	lag
2016-01-01	10
2016-02-01	12	10
2016-03-01	15	12

The lag function returns a value evaluated at the row that is definable offset before the current row within the partition. In this particular we have simply picked the value from the previous row (offset of 1). To compute growth rates, it’s just a matter of subtracting the current value from the previous value:

> SELECT date_trunc('month', timestamp) as Date,
         count(*) as Count,
         100 * (count(*) - lag(count(*), 1) over (order by timestamp)) / lag(count(*), 1) over (order by timestamp)) || '%' as growth
  FROM events
  WHERE event_name = 'Created Chart'
  GROUP BY 1
  ORDER BY 1

Date	Count	growth
2016-01-01	10
2016-02-01	12	20%
2016-03-01	15	25%

Tool 4: Running & Cumulative Sums

You might have a count of a particular event aggregated by day, for example with a query:

> SELECT date_trunc('day', timestamp) as Date,
         COUNT(metric) as Count
  FROM event_table
  GROUP BY 1
  ORDER BY 1;

Date	Count
2016-01-01	10
2016-01-02	20
2016-01-03	30
2016-02-01	10
2016-02-02	20
2016-02-03	30

Instead, you want to get a cumulative sum of the counts every day. In Postgres, window functions are the most efficient way to perform these kinds of operations.

> SELECT date_trunc('day', timestamp) as Date,
         COUNT(metric) as Count,
         SUM(COUNT(metric)) OVER (ORDER BY date_trunc('day', timestamp)) as CumSum
  FROM event_table
  GROUP BY 1
  ORDER BY 1;

Date	Count	CumSum
2016-01-01	10	10
2016-01-02	20	30
2016-01-03	30	60
2016-02-01	10	70
2016-02-02	20	90
2016-02-03	30	120

In the above query, the OVER creates the window, the ORDER BY means it has to sum up the counts in the timestamp order.

If instead you want to reset the running total every month, you’d use the PARTITION BY clause inside the window function:

> SELECT date_trunc('day', timestamp) as Date,
         COUNT(metric) as Count,
         SUM(COUNT(metric)) OVER ( PARTITION BY date_trunc('month', timestamp) ORDER BY date_trunc('day', timestamp)) as CumSum
  FROM event_table
  GROUP BY date_trunc('month', timestamp), 1
  ORDER BY 1;

Date	Count	CumSum
2016-01-01	10	10
2016-01-02	20	30
2016-01-03	30	60
2016-02-01	10	10
2016-02-02	20	30
2016-02-03	30	60

Photo Credit: Unsplash

An example of a time series

Tool 1: Use `date_trunc` to Group & Aggregate by custom periods

Tool 2: Use `generate_series` to plug gaps in the time interval

Solution

Tool 3: Month-over-Month growth rates

Tool 4: Running & Cumulative Sums

Product

Services

Resources

Legal

Company

An example of a time series

Tool 1: Use date_trunc to Group & Aggregate by custom periods

Tool 2: Use generate_series to plug gaps in the time interval

Solution

Tool 3: Month-over-Month growth rates

Tool 4: Running & Cumulative Sums

Product

Services

Resources

Legal

Company

Tool 1: Use `date_trunc` to Group & Aggregate by custom periods

Tool 2: Use `generate_series` to plug gaps in the time interval