Finding Duplicate Rows in SQL
A common mechanism for defending against duplicate rows in a database table is to put a unique index on the column. However, at times, your data might come from external dirty data sources and your table will have duplicate rows. You'll have to remove duplicate rows in the table before a unique index can be added.
A great way to find duplicate rows is by using window functions – supported by most major databases.
Consider a follow table dedup
with duplicates:
name | |
---|---|
mike | mike@example.com |
mike k | mike@example.com |
sean | sean@example.com |
sean | sean@example.com |
taylor | taylor@example.com |
Finding duplicate values in one column
The following query picks the email
column to deduplicate,
select email,
count(*)
from dedup
group by email
having count(*) > 1;
returns,
count | |
---|---|
mike@example.com | 2 |
sean@example.com | 2 |
which are the duplicate emails in the table with their counts. The next step is to number the duplicate rows with the row_number
window function:
select row_number() over (partition by email),
name,
email
from dedup;
row_number | name | |
---|---|---|
1 | mike | mike@example.com |
2 | mike k | mike@example.com |
1 | sean | sean@example.com |
2 | sean | sean@example.com |
1 | taylor | taylor@example.com |
We can then wrap the above query filtering out the rows with row_number
column having a value greater than 1.
select * from (
select row_number() over (partition by email),
name,
email
from dedup ) t
where t.row_number < 2;
row_number | name | |
---|---|---|
1 | mike | mike@example.com |
1 | sean | sean@example.com |
1 | taylor | taylor@example.com |
Notes about the ROW_NUMBER window function
The row_number
is a standard window function and supports the regular parameters for a window function. We'd like to point out two cases that are of interest:
- In a case where you want to pick a deduplicate row according a different criteria, you can make use of the
ORDER
clause inside the window function to order the partition. - In a case where you want to deduplicate on multiple columns, you can specific those columns are parameters to the
partition
clause.
No spam, ever! Unsubscribe any time. See past emails here.