Creating Pareto Charts to visualize the 80/20 principle in SQL
A Pareto chart highlights the most important factors amongst a large number of factors. A social scientist in Italy, Vilfredo Pareto discovered that roughly 20% of the population owned 80% of the wealth. From there, he noticed that these proportions described many other aspects of society. This led him to propose the 80-20 rule. The Pareto chart, named in his honor, shows the large contribution from a small proportion of the population.
You'll notice the 80-20 principle in action across your business:
- a small proportion of customers generate the most customer support load
- a small proportion of customers contribute the most to the over all revenue – financial statements usually list this as a risk factor
- a small proportion of web pages generate the most pageviews
- a small proportion of adwords keywords generate the most clicks and conversions, etc.
The Pareto chart has two components:
- a column chart where the height of the bar is the frequency of that particular category
- a line chart that shows the cumulative percentage contribution to the total
For this example, we'll be aggregating the top pages on our site by pageviews and a cumulative percentage contribution to the total number of page views in that period.
1. Frequency distribution using GROUP BY
To calculate the total number of pageviews, categorized by page title and sorted descending by pageviews:
select title,
count(*) as pageviews
from visits
where visits.dt > (now() - '2 weeks'::interval)
-- just calculate for the last two weeks
group by title
order by 2 desc
The above query will give us a table like the following:
title | pageviews |
---|---|
Calculating Percentage (%) of Total Sum | 354 |
Calculating Month-over-Month Growth Rate | 156 |
Calculating Relationships with Correlation Matrices | 156 |
Forecasting, Identifying Trends with Linear Regression | 134 |
... | ... |
2. Cumulative sum using Window functions
We'll next have to calculate a cumulative, running total (read more about SQL Running Totals.) For this, we'll be making use of window functions:
select title,
pageviews,
sum(pageview) over (order by pageviews desc) as cumsum
from (
// query as before
) tbl;
The result will now look like this:
title | pageviews | cumsum |
---|---|---|
Calculating Percentage (%) of Total Sum | 354 | 354 |
Calculating Month-over-Month Growth Rate | 156 | 510 |
Calculating Relationships with Correlation Matrices | 156 | 666 |
Forecasting, Identifying Trends with Linear Regression | 134 | 800 |
... | ... | ... |
The last remaining step is to divide the cumulative sum by the total number of pageviews to calculate a percentage. We can precalculate the total number of pageviews with a common table expression:
with total_pageviews as
(
select count(pageviews) as cnt
from visits
where visits.dt > (now() - '2 weeks'::interval)
)
select title,
pageviews,
sum(pageview) over (order by pageviews desc) * 1.0 / total_pageviews.cnt as cumsum
from (
select title,
count(*) as pageviews
from visits
where visits.dt > (now() - '2 weeks'::interval)
-- just calculate for the last two weeks
group by title
order by 2 desc
) tbl;
No spam, ever! Unsubscribe any time. See past emails here.