find duplicate records to delete or update in sql server
This article is to describe different methods to find duplicate records and also remove those rows using different methods in SQL Server. Many times some records are considered duplicate by your business logic but SQL Server does not treat them as duplicate records which makes it difficult to delete. In my experience most of the time DBA wants to remove records for data cleaning purpose. Here are some guidelines and ways to remove duplicate entries from your database.
Before you remove duplicates
Before you decide to delete data make a backup copy of table.
Should test your script by creating a copy of table and verify the results.
After removing duplicate data try creating constraints on table to prevent future issues with duplicate entries.
Sample Table with duplicate values
CREATE TABLE users
(
username varchar(15),
pwd varchar(15)
)
-- First Insert
INSERT INTO users VALUES('user1','pwd1')
INSERT INTO users VALUES('user2','pwd2')
INSERT INTO users VALUES('user3','pwd3')
-- Second Insert creating duplicates
INSERT INTO users VALUES('user1','pwd4')
INSERT INTO users VALUES('user2','pwd5')
INSERT INTO users VALUES('user3','pwd6')
-- Third Insert with duplicate values
INSERT INTO users VALUES('user1','pwd7')
INSERT INTO users VALUES('user2','pwd8')
INSERT INTO users VALUES('user3','pwd9')
The above table have duplicate usernames, SQL Server does not consider it as duplicate values but business logic defines username is duplicated in users table.
SQL Server row duplication means all fields must match.
SELECT DISTINCT * FROM users
Finding duplicate values using Group by and Having clause.
SELECT username, count(*) AS [Count]
FROM users
GROUP BY username HAVING count(*) > 1
First Method to delete duplicate rows.
SET ROWCOUNT 1
SELECT @@rowcount
WHILE @@rowcount > 0
DELETE usr
FROM users as usr
INNER JOIN
(SELECT username
FROM users
GROUP BY username HAVING count(*) > 1)
AS u ON u.username = usr.username
SET ROWCOUNT 0
The second method could be by creating a copy of table with unique index with Ignore Duplicate Key clause, this method require extra space.
--Create blank table with same structure as original table.
SELECT * INTO UsersCopy FROM users WHERE 1 = 0
--Create unique index with ignore duplicate keys
CREATE UNIQUE INDEX u_username ON userscopy (username) WITH IGNORE_DUP_KEY
-- Insert data into copy table from original table
INSERT INTO userscopy SELECT * FROM users
-- Truncate / empty original table
TRUNCATE TABLE users
-- Insert the clean data back
INSERT INTO users SELECT * FROM userscopy
-- Drop the index and copy of table
DROP INDEX userscopy.u_username
DROP TABLE userscopy
Third method is using cursors to perform delete, this will work on SQL Server 2005 only. We can pass the number of rows to remove in TOP clause.
DECLARE @username varchar(15)
DECLARE @cnt int
DECLARE cur CURSOR READ_ONLY
FOR
SELECT username, count(*) -1 as cnt
FROM users
GROUP BY username HAVING Count(*) > 1
OPEN cur
FETCH cur INTO @username, @cnt
WHILE @@FETCH_STATUS = 0
BEGIN
DELETE TOP(@cnt) FROM users WHERE username = @username
FETCH cur INTO @username,@cnt
END
CLOSE cur
DEALLOCATE cur
The fourth method could be typing each delete statement, this will work on SQL Server 2005 only because SQL Server 2000 does not allow passing parameters in TOP clause.
DELETE TOP
(
SELECT COUNT(*) - 1
FROM users
WHERE username = 'User1'
)
FROM users
WHERE username = 'User1'