Calculating Z-Score in SQL
In statistics, the z-score (or standard score) of an observation is the number of standard deviations that it is above or below the population mean.
where,
is the observation, is the mean and is the standard deviation.
You can use the z-score to answer questions like the following:
- What percentage of values fall below a specific value?
- What values can be expected to be outliers?
- What is the relative score of one distribution versus another? Literally comparing apples and oranges.
It is generally accepted that z-scores lower than -1.96 or higher than 1.96 to be outliers, or at the least worth a second look. These values are statistically significant at the 95% confidence level. For a higher confidence level of 99%, you are interested in z-scores of higher than 2.576 or lower than -2.576. We are assuming that your data fits a normal distribution.
Calculating the Z-score of Several Columns
Consider an ecommerce store that sells products online. The two metrics our analytics team is tracking everyday is sales and website visitors. The table will look something like this:
date | sales | visitors | conversion |
---|---|---|---|
2016-01-01 | 21 | 3373 | 0.62% |
2016-01-02 | 50 | 3820 | 1.31% |
2016-01-03 | 50 | 3175 | 1.57% |
2016-01-04 | 33 | 4013 | 0.82% |
2016-01-05 | 58 | 4022 | 1.44% |
2016-01-06 | 5 | 4873 | 0.25% |
2016-01-07 | 36 | 1924 | 1.87% |
2016-01-08 | 44 | 3867 | 1.14% |
2016-01-09 | 28 | 3621 | 0.77% |
2016-01-10 | 50 | 1722 | 2.90% |
We have some days with particularly good sales (eg: 2016-01-10 with 50 sales.) But it's difficult to get a feel for the data because the daily sales and visitors are not directly comparable. That's the "apples -vs- oranges" argument we put forth earlier.
Let's compute the z-scores for these two columns using SQL. We make use of Computed Table Expressions (CTEs) introduced in Postgres 9.4 in order to precalculate the means and standard deviations for the two columns:
with sales_stats as
(select avg(sales) as mean,
stddev(sales) as sd
from zscore),
visitor_stats as
(select avg(visitors) as mean,
stddev(visitors) as sd
from zscore)
select dt,
abs(sales - sales_stats.mean) / sales_stats.sd as z_score_sales,
abs(visitors - visitor_stats.mean) / visitor_stats.sd as z_score_visitors
from sales_stats,
visitor_stats,
zscore;
date | z_score_sales | z_score_visitors |
---|---|---|
2016-01-01 | 1.02 | 0.07 |
2016-01-02 | 0.77 | 0.39 |
2016-01-03 | 0.77 | 0.28 |
2016-01-04 | 0.28 | 0.59 |
2016-01-05 | 1.26 | 0.60 |
2016-01-06 | 2.00 | 1.48 |
2016-01-07 | 0.09 | 1.57 |
2016-01-08 | 0.40 | 0.44 |
2016-01-09 | 0.59 | 0.19 |
2016-01-10 | 0.77 | 1.78 |
Whoa, sales on 2016-01-06 has definitely take a hit and is worth investigating further!
No spam, ever! Unsubscribe any time. See past emails here.