How to delete duplicate rows in Sql Server

As database administrators, analyst or data engineers is typical to find duplicate data in our datasets.


This can happen for multiple reasons, sometimes is because someone apply wrongly a join or even because software connected to database send two or more times same content to our tables.

For this example we are going to create a table to show you how to delete duplicate rows.

CREATE TABLE [dbo].[paymentMethods](
	[PaymentMethodName] [nvarchar](50) NULL,
	[LastEditedBy] varchar(50) NULL,
	[EditDate] date NULL
) ON [USERDATA]
GO

This table contains 3 columns with no constraint on them.
After table is created we have to add 3 rows with the following information.

INSERT INTO paymentMethods(PaymentMethodName,LastEditedBy,EditDate)
values('Cash','Alvaro','20180202')
	 ,('Credit-Card','Robert','20180401')
	 ,('Debit-Card','John','20180601')

As you can see each row is unique.

Now we are going to clone each row once by using the following query.

INSERT INTO paymentMethods(PaymentMethodName,LastEditedBy,EditDate)
SELECT PaymentMethodName
	  ,LastEditedBy
	  ,EditDate 
FROM paymentMethods

As you noticed, each row is duplicated once

Method 1

This method only works if duplicate rows contains exactly the same data.

If you want to remove duplicate values from your table follow this steps.

    1. Create a temp table with duplicate values filtered by using DISTINCT clause. By using DISTINCT on SELECT statement our query is going ignore duplicate values on paymentMethods. table.
SELECT DISTINCT PaymentMethodName
			  , LastEditedBy
			  , EditDate
INTO ##TMP_paymentMethods
FROM paymentMethods
    1. Truncate table where you want to store distinct values.
TRUNCATE TABLE paymentMethods
    1. Insert into truncated table values stored on temp table
INSERT INTO paymentMethods(PaymentMethodName, LastEditedBy,EditDate)
SELECT PaymentMethodName
	 , LastEditedBy
	 , EditDate 
FROM ##TMP_paymentMethods

Now table does not contain duplicates

This method was design to show you how to use DISTINCT clause. I don’t really recommend this method, but it can work fine for small tables.

Method 2

This method is more effective and it does not require to truncate the table but it requires more querying knowledge than the other method

For this method we are going to use CTE(Common Table expressions) and Row_Number() function.
If you don’t know how to use them you can find information on the following links ROW NUMBER and Common Table Expressions

For this example we are going to execute the following query 2 times

INSERT INTO paymentMethods(PaymentMethodName,LastEditedBy,EditDate)
SELECT PaymentMethodName
	  ,LastEditedBy
	  ,EditDate 
FROM paymentMethods


This is going to fulfill our table with 3 additional copies of our dataset.
If you want to know how many times you have a value in your table all you have to do is a Group By.

SELECT PaymentMethodName
	 , LastEditedBy
	 , EditDate 
	 , COUNT(*) num_rep
FROM paymentMethods
GROUP BY PaymentMethodName
	 , LastEditedBy
	 , EditDate 

Each Value is repeated 4 times

If you want to delete repeated values and you don’t want to repopulate the table, follow this steps.

    1. Start by numerating columns in our table by using Row_Number function
      SELECT PaymentMethodName
      ,LastEditedBy
      ,EditDate 
      ,ROW_NUMBER() 
      	OVER(PARTITION BY PaymentMethodName, LastEditedBy ORDER BY EditDate) as rn
      ,ROW_NUMBER() 
      	OVER(PARTITION BY PaymentMethodName, LastEditedBy,EditDate ORDER BY PaymentMethodName) as rn_parttot
      FROM paymentMethods
      

      Result is

      As you noticed both [rn] and [rn_parttot] columns contains the same information, even though they have different partition and order by clauses.
      This happens because duplicate rows contains exactly the same data. For this case i recommend you to partition your ROW_NUMBER by PaymentMethodName and LastEditedBy, because you may be able to find that different [PaymentMethodName] are edited by different users [LastEditedBy].

    2. delete rn(Row_Number) greater than 1.
      WITH C AS(
      SELECT PaymentMethodName
      ,LastEditedBy
      ,EditDate 
      ,ROW_NUMBER() 
      	OVER(PARTITION BY PaymentMethodName, LastEditedBy ORDER BY EditDate) as rn
      FROM paymentMethods
      )
      DELETE
      FROM C
      WHERE RN >1
      

      By deleting rn that are greater than 1 you are maintaining rn with value 1 and filtering the rest.

I hope this helps you when dealing with duplicate data


Leave a Reply

Your email address will not be published. Required fields are marked *