Finding Patterns & Matching Substrings using Regular Expressions
Although there are multiple ways to do pattern matching in SQL, we will look at the most powerful and flexible of them all – regular expressions.
Fortunately, PostgreSQL and Redshift support POSIX regular expressions out of the box. We'll consider two examples that use regular expressions to pattern match.
1. Pattern Matching Business and Consumer Email Addresses
A common analysis task is to count the number of emails in your database that are of commercial value, i.e., not using disposable email addresses or free providers like gmail or hotmail.
We'll make an assumption that the @
character splits the email address to a name and domain name. If your database table named "emails" contains email addresses like these:
first@gmail.com |
second@gmail.com |
third@business.com |
fourth@provider.com |
Then the split_part
function will split the email address to its respective parts:
select email,
split_part(email, '@', 1) as name,
split_part(email, '@', 2) as domain
from email;
which gives us:
name | domain | |
---|---|---|
first@gmail.com | first | gmail.com |
second@gmail.com | second | gmail.com |
third@business.com | third | business.com |
fourth@provider.com | fourth | provider.com |
We'll next need a database of services that hand out free and disposable email addresses. We found a github project that has an actively maintained list of such services. If the data for these services is in a table named "free_email_domains", a simple join with this table will classify an email as a business user or a consumer.
select emails.email as email,
case when
free_email_domains.domain is not null
then 'free'
else 'business'
end as email_type
from emails
join free_email_domains
on free_email_domains.domain = split_part(emails.email, '@', 2);
email_type | |
---|---|
first@gmail.com | free |
second@gmail.com | free |
third@business.com | business |
fourth@provider.com | business |
2. Pattern Matching Google Analytics UTM parameters
If you've ever worked with marketers, you would have come across UTM parameters. These are small bits of information attached to every URL to track campaign and channel effectiveness amongst others. A tracked URL will look like something like this:
http://www.example.com/?utm_source=facebook&utm_medium=social&utm_campaign=black-friday
The parameters are attached as query parameters after the "?" and have standard definitions like source, medium, campaign, etc. You can use a handy tool like the Google Campaign URL Builder to build URLs like these.
Suppose we have a table named pageviews
with the following schema:
dt | url |
---|---|
2016-01-01 | http://www.example.com/?utm_source=facebook&utm_medium=social&utm_campaign=black-friday |
2016-01-02 | http://www.example.com/?utm_source=google&utm_medium=cpc&utm_campaign=black-friday |
and we want a result set that looks like this:
dt | source | medium | campaign |
---|---|---|---|
2016-01-01 | social | black-friday | |
2016-01-02 | cpc | black-friday |
We are going to make use the substring
and regexp_replace
functions to first extract a substring that matches the regular expression for campaign, source and medium and then replace the matched text to get what we want. We can do this in one step, but for illustration purposes, we'll do it in two steps. Our regular expression is going to take the form:
(?!&)utm_campaign=[^&]*(?=&)
To first extract utm_campaign
, we use the query:
SELECT dt, substring(url from '(?!&)utm_campaign=[^&]*(?=&)')
FROM pageviews;
which gives us:
dt | substring |
---|---|
2016-01-01 | utm_campaign=facebook |
2016-01-02 | utm_campaign=google |
then we replace the utm_campaign=
string with an empty string for the final result:
SELECT dt, regexp_replace(substring(url from '(?!&)utm_campaign=[^&]*(?=&)'),
'utm_[^=]*=', '') as utm_campaign
FROM pageviews;
which gives us:
dt | utm_campaign |
---|---|
2016-01-01 | |
2016-01-02 |
We can now repeat this pattern for utm_medium
and utm_source
to get the final result.
No spam, ever! Unsubscribe any time. See past emails here.