Making Correlation Coefficient Matrices to understand relationships in SQL
A scatter X-Y plot is a straightforward way to visualize the dependency between two variables. However, at times you want to understand how more than two variables are related. The correlation coefficient can be calculated between pairs of variables that scales between +1 and -1 demonstrating the degree of positive or negative correlation. The relationship between three variables can be presented as a symmetric matrix of dimensions with a value of 1 along the diagonal.
In this recipe, we'll look for correlations betwen two or more variables and visualize it as a matrix. You can use the correlation matrix to figure out what activities are correlated, to plan future activities. For example, within your customer support department, how are net promoter scores (NPS) related to support wait times.
The Correlation Coefficient math
One of the most commonly used correlation formula is Pearson’s. For a sample,
where, and are the sample means of variables and .
The SQL to correlate email subject lengths and open rates
Using Mailchimp's email campaign data – we are interested in finding if there's a relationship between the length of the subject and open/click rates.
1. Calculating correlation by hand
We are going to be using Common Table Expressions (CTEs) to organize our intermediate results:
with table_mean as
(
select avg(char_length(subject_line)) as mean_subject_length,
avg(report_summary_open_rate) as mean_open_rate
from mailchimp.campaigns
),
table_corrected as
(
select char_length(subject_line) - mean_subject_length as mean_subject_length_corrected,
report_summary_open_rate - mean_open_rate as mean_open_rate_corrected
from table_mean, mailchimp.campaigns
),
select sum(mean_subject_length_corrected * mean_open_rate_corrected) / sqrt(sum(mean_subject_length_corrected * mean_subject_length_corrected) * sum(mean_open_rate_corrected * mean_open_rate_corrected)) as r
from table_corrected;
This is a direct translation of the math equation. Fortunately, we don't have to repeat this each time, we can simply use the in-built corr
function to calculate the correlation for us.
2. Calculating pairwise correlation using corr
The correlation can be calculated as follows:
select corr(char_length(subject_line), report_summary_open_rate) as r;
For more than two variables, we are going to repeat the correlation calculation pairwise between the variables and organize the results in the follow format. In step 3, it'll be clear why we use this format:
row | col | coeff |
---|---|---|
subject_length | subject_length | (always 1) |
subject_length | open_rate | coefficient value |
subject_length | click_rate | coefficient value |
open_rate | open_rate | (always 1) |
open_rate | click_rate | coefficient value |
click_rate | click_rate | (always 1) |
where subject_length
is char_length(subject_line)
.
select 'subject_length' as row,
'subject_length' as col,
corr(subject_length, subject_length) as coeff
from mailchimp.campaigns
union
select 'subject_length' as row,
'open_rate' as col,
corr(subject_length, open_rate) as coeff
from mailchimp.campaigns
union
select 'subject_length' as row,
'click_rate' as col,
corr(subject_length, click_rate) as coeff
from mailchimp.campaigns
union
select 'open_rate' as row,
'open_rate' as col,
corr(open_rate, open_rate) as coeff
from mailchimp.campaigns
union
select 'open_rate' as row,
'click_rate' as col,
corr(open_rate, click_rate) as coeff
from mailchimp.campaigns
union
select 'click_rate' as row,
'click_rate' as col,
corr(click_rate, click_rate) as coeff
from mailchimp.campaigns
3. Pivoting the table to get a matrix
In this step, we'll be building a manual pivot table using the values from the col
column. After the pivot, the values of the col
column will become new columns in the resulting table.
select row,
sum(case when col='subject_length' then coeff else 0 end) as subject_length,
sum(case when col='open_rate' then coeff else 0 end) as open_rate,
sum(case when col='click_rate' then coeff else 0 end) as click_rate
from (
// ... query as before
)
group by row
order by row DESC
and our table will look like this:
row | subject_length | open_rate | click_rate |
---|---|---|---|
subject_length | 1 | -0.31595114872448 | -0.102740332427165 |
open_rate | 0 | 1 | 0.798087505055931 |
click_rate | 0 | 0 | 1 |
which matches the values we get from Excel or Google sheets. Our calculations show that open and click rates are negatively correlated with subject lengths and there's a strong positive relation between open and click rates (which is obvious because they are not independent variables – you need to open an email in order to click it.)
No spam, ever! Unsubscribe any time. See past emails here.