The SQL Guru Answers your Questions...
|
Is there a SQL that I can use to delete duplicate entries from a data
store, while leaving a distinct copy - leave a single copy, remove all
duplicate except one?
|
From your question, it is unclear whether your table has a unique key or
not. Since you refer to this as a "data store", I'm guessing that your
duplicates might be true duplicates, meaning that every value in every
column is identical. Let me first address the case where the table does not
have a unique key.
NO UNIQUE KEY
In this case, we have a difficult problem if we are trying to solve this
with a single SQL Statement. In this situation, I recommend one of the
following approaches:
1.) Add a unique key to the table
This is easy. Add a column called ID as an integer, and make it
an
identifier column by checking the identity box in the table design window.
Set the Identity Seed to 1 and the Identity Increment to 1. The column will
automatically be populated with unique values for each row in the table.
Proceed to UNIQUE KEY section below.
2.) Write a stored procedure.
The strategy here would be to write a query that returns a row for each
set of duplicates, using a query such as the following:
SELECT Field1, Field2, Count(ID)
FROM Foo1
GROUP BY Foo1.Field1, Foo1.Field2
HAVING Count(Foo1.ID) > 1
|
Use a cursor to loop through the returned rows, then for each set of
duplicates, read all rows for that set:
SELECT Field1, Field2, ID
FROM Foo1
WHERE Field1 = @FIELD1 and Field2 = @FIELD2
|
Then delete each row except the first one returned, for each set of
duplicates.
UNIQUE KEY
If dealing with a table that does have a unique key, the problem of removing
duplicates is much easier, and able to be accomplished in one SQL statement
such as the following:
DELETE
FROM Foo1
WHERE Foo1.ID IN
-- List 1 - all rows that have duplicates
(SELECT F.ID
FROM Foo1 AS F
WHERE Exists (SELECT Field1, Field2, Count(ID)
FROM Foo1
WHERE Foo1.Field1 = F.Field1
AND Foo1.Field2 = F.Field2
GROUP BY Foo1.Field1, Foo1.Field2
HAVING Count(Foo1.ID) > 1))
AND Foo1.ID NOT IN
-- List 2 - one row from each set of duplicate
(SELECT Min(ID)
FROM Foo1 AS F
WHERE Exists (SELECT Field1, Field2, Count(ID)
FROM Foo1
WHERE Foo1.Field1 = F.Field1
AND Foo1.Field2 = F.Field2
GROUP BY Foo1.Field1, Foo1.Field2
HAVING Count(Foo1.ID) > 1)
GROUP BY Field1, Field2);
|
Since this may appear complicated, let me explain. My strategy here is to
return two lists: The first, List 1, is a list of all rows that have
duplicates, and the second, List 2, is a list of one row from each set of
duplicates. This query simply deletes all rows that are in List 1 but not
in List 2.
Tom