Calculating percentiles, quartiles, deciles, and N-tiles in SQL
A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 60th percentile is the value below which 60% of the observations may be found.
Given a set of observations that has been sorted, the median, first quartile and third quartile can be used split the data into four pieces. The first quartile is the point at which one fourth of the data lies below it. The median is located exactly in the middle of the data set, with half of all of the data below it. The third quartile is the place where three fourths of the data lies below it.
Percentiles have a lot of applications:
- Used to monitor SLA violations (eg. meet 300ms response time)
- Burstable billing (which is the example we'll use)
Burstable billing is a common practice among service providers to not penalize occasional overages. Consider a table with 1 day apart samples of bandwidth consumed:
dt | customer_id | bandwith |
---|---|---|
2016-01-01 | 1 | 4.0 |
2016-01-01 | 2 | 1.2 |
2016-01-02 | 1 | 8.0 |
2016-01-02 | 2 | 3.2 |
2016-01-03 | 1 | 1.4 |
2016-01-03 | 2 | 13.9 |
2016-01-04 | 1 | 5.3 |
2016-01-04 | 2 | 1.8 |
2016-01-05 | 1 | 4.6 |
2016-01-05 | 2 | 3.5 |
... | ... | ... |
You want to charge your customers based on the 95th percentile.
Fortunately, we have access the the NTILE
window function that divides an ordered partition into buckets and assigned a bucket number to each row in the partition.
Step by step:
- Use
ntile(100)
to split the data into 100 roughly even sized buckets. - Use the partition parameter in the window definition to specific a partiion by
customer_id
. - Pick the max value in the 95th bucket for that customer.
select date_trunc('day', dt), customer_id, sum(bandwith) as bandwith
from usage
group by 1, 2;
Will get us the total bandwidth for each customer on each day.
select customer_id,
bandwith,
ntile(100) over (partition by customer_id order by bandwith asc) percentile
from ( ... ) customer_bandwith_percentiles;
Now you can just pick out the max of the bandwith with the percentile bucket number being 95.
select customer_id,
max(bandwith)
from ( ... )
where customer_bandwith_percentiles.percentile = 95
group by 1;
No spam, ever! Unsubscribe any time. See past emails here.