Multichannel Marketing Attribution Modeling in SQL
A big challenge a lot of marketing departments face is measuring and improving the effectiveness of their marketing campaigns. It's important to figure out ROI because you can then double-down on campaigns/channels that have positive returns and discard marketing programs that are ineffective.
With one or two channels, it's straightforward to attribute a conversion event (i.e., purchase or signup) to that channel. With three or more channels, it gets complicated fast, and we need an attribution model to fully credit the conversion event.
Outcomes of the attribution model:
- budgeting – use historical attribution data to inform future spend
- forecasting – expectations on how future spend will perform
- branding – measure the impact of branding campaigns
- channel ROI – measure the performance of a channel, key to computing cost of customer acquisition and lifetime value of a customer
- affiliate payments – measure the impact of a partner channel on your conversion events
1. What is marketing attribution
Consider three users: Alice, Bob and Eve who've completed a conversion event – make a purchase. The timeline of events is as follows:
User | Session 3 | Session 2 | Session 1 | Purchase |
---|---|---|---|---|
Alice | – | – | Adwords | $2,000 |
Bob | – | Direct | $5,000 | |
Eve | SEO | SEM | $1,000 |
Let's enumerate what's happening here:
- three users span five different channels: Adwords, Email, SEO, Direct (as in they type in the URL of your website) and Facebook display adverts
- the session/channel mix can be multiple days apart
- the conversion values ($2,000, $5,000 and $1,000) are different
The conversion event for Alice can be attributed to Adwords directly, but what about Bob and Eve who've had touch points over multiple channels and over multiple sessions. What's the return on each channel? How do you compute the cost of customer acquisition? Attribution modeling is about answering these types of questions.
2. Common attribution models (as seen in Google Analytics)
We'll use Google Analytics, the 800lb gorilla in the analytics space, to guide our models. Google Analytics lists seven kinds of attribution models:
Model | Description | Benefit |
---|---|---|
Last click | 100% of the credit goes to the last click. | Simple, but only useful for brand awareness campaigns. |
First click | 100% of the credit goes to the first click. | Simple, but ignores the channel that ultimately drove the conversion. |
Linear | Evenly distribute credit to all clicks. | Simple, but does not differentiate between high value events and low value events. |
Time decay | More credit to the most recent click, credit reduces with time. | Recognizes the significance of every channel along the conversion path and gives the most credit to the channel that drove the conversion. |
Position based | More credit to the first and last, everything else is evenly distributed. | Combines the best features of linear and time-decay models. |
Last non-direct clicks | 100% of the credit to the last non-direct channel and ignores all other channels. | Remove the limitation of "direct" traffic. |
Last AdWords click | 100% of the credit to the last AdWords click. | Simple to understand and easy to compute the cost of acquisition. |
3. Why "last click" is the default attribution model
The "last click" attribution model is an industry standard and the default for many marketing departments. This model assigns 100% of the credit for a conversion event to the last marketing channel the customer used. For example, if you clicked on a sponsored search/banner advert for a product and then buy, that channel will get 100% of the credit. It ignores all previous clicks/interactions before that conversion.
The assumption is that had the advert not been shown, the probability of purchase would have been 0%. This is a big assumption to make. In a 2012 survey by Econsultancy and Google, only 14% of marketers believed that "last click" was effective. Yet, the same survey revealed that more than half of them continued to rely on it for their day-to-day reporting.
Hmm, what gives?
4. What's the best attribution model
There is no right answer to this question. Pick what makes sense for your business. What we do want to highlight is the profitability of a channel depends on the attribution model you pick. Let's take an example:
A | B | C | D | E | F | G | H | I | J | |
---|---|---|---|---|---|---|---|---|---|---|
1 | Last click | First click | Linear | Time decay | ||||||
2 | Channel | CPC | Credit | Cost | Credit | Cost | Credit | Cost | Credit | Cost |
3 | $0.75 | 0% | (not profitable) | 100% | $0.75 | 25.00% | $3.00 | 10% | $7.50 | |
4 | Banner | $1.00 | 0% | (not profitable) | 0% | (not profitable) | 25.00% | $4.00 | 15% | $6.67 |
5 | $1.50 | 0% | (not profitable) | 0% | (not profitable) | 25.00% | $6.00 | 25% | $6.00 | |
6 | SEM | $2.00 | 100% | $2.00 | 0% | (not profitable) | 25.00% | $8.00 | 50% | $4.00 |
Assuming a user goes sequentially through the channels Facebook, Banner, Email and SEM with the respective cost per click (CPC) of $0.75, $1.00, $1.50 and $2.00. If we followed the last click model or the time-decay model, SEM is the cheapest source of acquiring customers. In the first-click or linear models, Facebook is the cheapest source.
The cost of customer acquisition along with payback period is one of the fundamentals of valuing a business. Getting this wrong can put your business on the wrong path to profitability!
5. SQL for marketing attribution modeling
5a. Organizing the marketing touch points
Our first step is to organize the marketing touch points so that we can iterate on our attribution models. Given a table pageviews
of web traffic logs with the schema:
timestamp | user_id | url |
---|---|---|
2017-01-01 | 1 | http.../sql-z-score.html?utm_source=Silota+SQL+Updates&utm_medium=email |
2017-01-02 | 2 | http.../sql-z-score.html?utm_source=Adwords&utm_medium=cpc |
2017-01-03 | 3 | http.../sql-z-score.html?utm_source=facebook&utm_medium=social&utm_campaign=black-friday |
... | ... | ... |
where the url
column contains the marketing channel source, medium. campaign and any other relevant information. We can parse this url to a friendly column with a simple case
statement:
select timestamp,
user_id,
case
when url ilike '%utm_medium=email%' then 'email'
when url ilike '%utm_medium=cpc%' then 'paid'
end as channel
from pageviews;
where, ilike
is the case-insensitive pattern matching function.
The above is only meant for illustrative purposes – for a complete regular expression that can parse out UTM parameters, see our SQL Regexp recipe. We now have a table that looks something like this:
timestamp | user_id | channel |
---|---|---|
2017-01-01 | 1 | |
2017-01-02 | 2 | paid |
2017-01-03 | 3 | banner |
2017-01-04 | 1 | paid |
... | ... | ... |
We can perform one more level of aggregation usign array_agg
to organize the channels into a PostgreSQL array for each user. The array_agg
is a way to aggregate a set of string values in a comma separated format.
select user_id,
array_agg(
case
when url ilike '%utm_medium=email%' then 'email'
when url ilike '%utm_medium=cpc%' then 'paid'
else 'unknown'
end order by timestamp
)::text as channels
from pageviews;
user_id | channels |
---|---|
1 | {email, paid} |
2 | {paid, banner, organic} |
3 | {banner, organic, paid} |
... | ... |
Let's call this table marketing_channels
.
5b. Organizing the revenue and conversion events
The revenue and conversion data can be dumped from your lead database or payment processor. The structure of this table transactions
will be something like this:
timestamp | user_id | amount |
---|---|---|
2017-01-01 | 1 | 200.0 |
2017-01-10 | 3 | 150.0 |
2017-01-15 | 4 | 100.0 |
... | ... | ... |
We can aggregate the revenue amounts for each user to get total amounts.
-- saved as revenue_events
select user_id, sum(amount) as total
from transactions
group by user_id;
It's now just a matter of joining the revenue data revenue_events
and marketing touches marketing_channels
with the right attribution model to connect revenue to channel.
5c. First-click marketing attribution model
The first-click attribution model gives 100% to the credit to the first channel the user is seen. We can use PostgreSQL's array subscripts to pick out the first channel.
select mc.user_id, mc.channels[1], re.total
from marketing_channels mc
inner join revenue_events re
on mc.user_id = re.user_id;
user_id | channels | total |
---|---|---|
1 | 200.0 | |
3 | banner | 150.0 |
... | ... | ... |
A group by channel and aggregating the total
column will give us the total revenue attributed to each channel:
select channel, sum(total) from (
select mc.user_id, mc.channels[1], re.total
from marketing_channels mc
inner join revenue_events re
on mc.user_id = re.user_id
) a
group by channel;
5d. Last-click marketing attribution model
Similar to the first-click model, we give 100% of the credit to the last channel the user is seen. We'll use PostgreSQL's array_length
to find the length of the channels array and use that as a subscript.
select mc.user_id, mc.channels[array_length(mc.channels, 1)], re.total
from marketing_channels mc
inner join revenue_events re
on mc.user_id = re.user_id;
user_id | channels | total |
---|---|---|
1 | paid | 200.0 |
3 | paid | 150.0 |
... | ... | ... |
5e. Linear marketing attribution model
This model evenly distributes the credit to all marketing touch points. The approach we'll take here is to calculate the weight of each channel by the number of touch points and then aggregate by the weighted revenue. Taking this step by step:
select user_id, channels, 1.0 / array_length(channels ,1) as weights
from (
// ... query for marketing_channels as before
)
gives us:
user_id | channels | weights |
---|---|---|
1 | {email, paid} | 0.5 |
2 | {paid, banner, organic} | 0.3333 |
... | ... | ... |
We'll now expand the list of channels into individual rows using PostgreSQL's array unnest
function:
select user_id, channels, weights, unnest(channels)
from (
select user_id, channels, 1.0 / array_length(channels, 1) as weights
from (
// ... query for marketing_channels as before
)
) b
which gives us:
user_id | channels | weights | unnest |
---|---|---|---|
1 | {email, paid} | 0.5 | |
1 | {email, paid} | 0.5 | paid |
2 | {paid, banner, organic} | 0.3333 | paid |
2 | {paid, banner, organic} | 0.3333 | banner |
2 | {paid, banner, organic} | 0.3333 | organic |
... | ... | ... | ... |
A join with the revenue_events
on the user_id
, multiplying the revenue by the weights and an aggregate by channel gives us the weighted revenue total by channel.
select unnest_channel, sum(c.weights * re.total)
from (
select user_id, channels, weights, unnest(channels) as unnest_channel
from (
select user_id, channels, 1.0 / array_length(channels ,1) as weights
from (
// ... query for marketing_channels as before
)
) b
) c
inner join revenue_events re
on c.user_id = re.user_id
group by unnest_channel;
5f. Time-decay marketing attribution model
As with the linear model, we can vary the weights by a decay function. We'll use as the decay function, so the weights will be , , and so on. We also have to make sure the weights add upto 1.
In general, is a geometric series and the sum has a closed form equation:
We'll do unnest
to get individual rows, row_number
to enumerate the rows partitioned by user_id
, pow(2.0, row_number())
to generate the weights, and a sum
with the closed form equation to scale the weights back to 1.
No spam, ever! Unsubscribe any time. See past emails here.