How to delete duplicate records in snowflake

Duplicate records can be a common problem when working with large datasets in Snowflake. These duplicates can cause confusion and inaccuracies in your analysis and reporting.

In this article, we will discuss a methods for identifying and deleting duplicate records in Snowflake. By the end of this article, you will have a better understanding of how to effectively manage and remove duplicate records in Snowflake.

On a typicall database this can be done by executing the following query.

WITH C AS(
SELECT *
	  ,ROW_NUMBER() OVER(PARTITION BY ROWS ORDER BY ROWS) AS RN
FROM TABLE
)
DELETE
FROM C
WHERE RN > 1;

But by executing the following query on Snowflake you get an error message

This happens as Snowflake does not support the typical update/delete/insert statements when using a CTE.

For overcomming this issue you can execute the following steps.


Step 0. Creating Demo table

First we are going to create a demo table and insert the elements into that table.

-- Create table 
CREATE OR REPLACE TABLE SANDBOX.TEST.USERS(
    NAME VARCHAR(50) NULL,
    LAST_NAME VARCHAR(50) NULL,
    USER_NAME VARCHAR(50) NULL,
    CREATED_AT TIMESTAMP NULL 
);
-- Inserting data into table, this values would serve for control
INSERT INTO SANDBOX.TEST.USERS(NAME,LAST_NAME,USER_NAME,CREATED_AT)
values('Serena','Williams','swilliams',CURRENT_TIMESTAMP)
     ,('Stu','Johnson','sjohnson',CURRENT_TIMESTAMP);
-- Running the table two times
INSERT INTO SANDBOX.TEST.USERS(NAME,LAST_NAME,USER_NAME,CREATED_AT)
values('Alvaro','Parra','aparra',CURRENT_TIMESTAMP)
     ,('John','Doe','jdoe',CURRENT_TIMESTAMP)
     ,('Jack','Doe','jadoe',CURRENT_TIMESTAMP);
INSERT INTO SANDBOX.TEST.USERS(NAME,LAST_NAME,USER_NAME,CREATED_AT)
values('Alvaro','Parra','aparra',CURRENT_TIMESTAMP)
     ,('John','Doe','jdoe',CURRENT_TIMESTAMP)
     ,('Jack','Doe','jadoe',CURRENT_TIMESTAMP);

Then, we are going to check the duplicate rows on the table by executing the following query.

-- CHECKING THE DUPLICATES ON THE TABLE
SELECT *
      ,ROW_NUMBER() OVER(PARTITION BY NAME, LAST_NAME, USER_NAME ORDER BY CREATED_AT DESC) AS RN
FROM SANDBOX.TEST.USERS;

As you noticed in that query we have some cases were we have duplicate values based on NAME, LAST_NAME and USER_NAME.

Now we can proceed to delete those duplicate rows in the table.

Step 1. Store duplicate records into a table

The following query is creating a temp table storing all the duplicate records that we have on the table (based on the columns that we are going to use for identifying duplicate rows).

CREATE OR REPLACE TEMP TABLE SANDBOX.TEST.DELDUP_USERS AS
WITH DUP AS(
SELECT *
      ,ROW_NUMBER() OVER(PARTITION BY NAME, LAST_NAME, USER_NAME ORDER BY CREATED_AT DESC) AS RN
FROM SANDBOX.TEST.USERS
)
SELECT *
FROM DUP
--USING RN TO FILTER DUPLICATE ROWS AND ONLY KEEP ONE
WHERE RN=2;

Step 2. Delete records from the table

Then we are going to delete the duplicated rows from the table. this query deletes the records that we stored on the temp table, including duplicates and non duplicates .

DELETE 
FROM SANDBOX.TEST.USERS AS TAB
USING SANDBOX.TEST.DELDUP_USERS AS DEL
WHERE TAB.NAME = DEL.NAME
AND TAB.LAST_NAME = DEL.LAST_NAME
AND TAB.USER_NAME = DEL.USER_NAME

Step 3. Insert records from the temp table

After that we are going to reinsert the rows that we stored in the temp table, by doing that we are going to get rid of the duplicate rows and only store unique records on this table.

INSERT INTO SANDBOX.TEST.USERS(NAME,LAST_NAME,CREATED_AT)
SELECT NAME
      ,LAST_NAME
      ,USER_NAME
      ,CREATED_AT
FROM SANDBOX.TEST.DELDUP_USERS;

As you noticed we deleted 6 records from this table which included the duplicate and non duplicate rows and then we inserted back 3 unique rows into the table.

By using this method you can easily get rid of duplicate records on your tables.


Leave a Reply

Your email address will not be published. Required fields are marked *