删除重复记录而不创建临时表

I have a table with many duplicate records:

shop
ID     tax_id
1      10
1      10
1      11
2      10
2      12
2      10
2      10

I want to delete all duplicate records without creating a temporary table. After the update query, the table should look like:

shop
ID     tax_id
1      10
1      11
2      10
2      12

Here's an in-place solution (but not one-liner)

Find out max id:

select max(id) as maxid 
  from shop;

Remember this value. Let's say it equals to 1000;

Re-insert unique values, with offset:

insert into shop (id, tax_id) 
select distinct id + 1000, tax_id 
  from shop;

Drop old values:

delete from shop
  where id <= 1000;

Restore normal ids:

update shop
  set id = id - 1000;

PROFIT!

First off, you can prevent this by creating a Unique Index on those two fields, for future reference.

As for the solution, create a new table shopnew with the same structure in mysql, or just delete every record from the table when the recordList is generated (make sure you have a backup!):

//Get every record from mysql
$sSQL = "Select ID, tax_id from shop";
$oRes = mysql_query($sSQL);
$aRecordList = array();
while($aRow = mysql_fetch_assoc($oRes)){
      //If record is a duplicate, it will be 'overwritten' 
      $aRecordList[$aRow['id'].".".$aRow['tax_id']] =1; 
}
//You could delete every record from shop here, if you dont want an additional table
//recordList now only contains unique records
foreach($aRecordList as $sRecord=>$bSet){
   $aExpRecord = explode(".",$sRecord);
   mysql_query("INSERT INTO shopnew set id=".$aExpRecord[0].", tax_id = ".$aExpRecord[1]
 }

Working solution.

//Sql query to find duplicates
SELECT id, tax_id, count(*) - 1 AS cnt 
  FROM shop 
  GROUP BY id
  HAVING cnt > 1

--- res

+------+--------+-----+
| id   | tax_id | cnt |
+------+--------+-----+
|    1 |     10 |   2 |
|    2 |     10 |   3 |
+------+--------+-----+


//Iterate through results with your language of choice
DELETE 
  FROM shop 
  WHERE id=<res id> 
    AND tax_id=<res tax_id> 
  LIMIT <cnt - 1>

---res (iterated)

+------+--------+
| id   | tax_id |
+------+--------+
|    1 |     10 |
|    1 |     11 |
|    2 |     12 |
|    2 |     10 |
+------+--------+

The two queries will require a small piece of php in order to carry out the deletes

$res = mysql_query("SELECT id, tax_id, count(*) - 1 AS cnt 
                      FROM shop 
                      GROUP BY id
                      HAVING cnt > 1")
while($row = mysql_fetch_assoc($res)){
    mysql_query("DELETE 
                   FROM shop 
                   WHERE id=".$row['id']."
                       AND tax_id=". $row['tax_id']."
                   LIMIT ".$row['cnt'] -1 . ");
}

Edit: Revisited this recently, for what it's worth, here's an alternative solution using a temporary column, removing the need for a scripting language.

ALTER TABLE shop ADD COLUMN place INT;

SET @i = 1

UPDATE shop SET place = @i:= @i + 1;

DELETE FROM shop WHERE place NOT IN (SELECT place FROM items GROUP BY id, tax_id);

ALTER TABLE shop DROP COLUMN place;

In reality the question with its current limitations is quite a tricky challenge. I thought about the solution the whole evening (understanding that the solution will never be of use). I would not use the solution in wild nature, I just tried to find out if it is possible to do using MySQL only.

The question in my formulation: is it possible to write a series of DELETE statements that will remove duplicate rows from a two-column table without unique constraints?

Problems:

  1. rows do not have an identity key or a primary key, so one should think up a way to refer to a single row that should stay
  2. we will need to group rows somehow, that is to apply an order and then condition, but the form of DELETE that supports ORDER BY can only have a WHERE clause and does not support HAVING. That is the order is applied after a condition is met.
  3. we would not need to sort rows if the values would be arranged by the clustered primary key, but we do not have one.

Suppose we have a table:

CREATE TABLE  `tablename` (
  `a_id` int(10) unsigned NOT NULL,
  `b_id` int(10) unsigned NOT NULL,
  KEY `Index_1` (`a_id`,`b_id`)
) ENGINE=InnoDB COLLATE utf8_bin;

I added a key (not UNIQUE or PRIMARY) to make lookups faster and hoping to use it in groupings.

You can feed the table with some values:

INSERT INTO tablename (a_id, b_id) VALUES (2, 3), (1, 1), (2, 2), (1,4);
INSERT INTO tablename (a_id, b_id) VALUES (2, 3), (1, 1), (2, 2), (1,4);
INSERT INTO tablename (a_id, b_id) VALUES (2, 3), (1, 1), (2, 2), (1,4);

As a side effect, the key became a coverage index and when we make SELECTs from the table the values displayed are sorted, but when we make deletions the values are read in the order we inserted them.

Now, let's look at the following query:

SELECT @c, @a_id as a, @b_id as b, a_id, b_id
FROM tablename, (SELECT @a_id:=0, @b_id:=0, @c:=0) as init
WHERE (@c:=IF(LEAST(@a_id=(@a_id:=a_id), @b_id=(@b_id:=b_id)), @c+1, 1)) >= 1
;

And its result:

@c, a, b, a_id, b_id
 1, 1, 1,    1,    1
 2, 1, 1,    1,    1
 3, 1, 1,    1,    1
 1, 1, 4,    1,    4
 2, 1, 4,    1,    4
 3, 1, 4,    1,    4
 1, 2, 2,    2,    2
 2, 2, 2,    2,    2
 3, 2, 2,    2,    2
 1, 2, 3,    2,    3
 2, 2, 3,    2,    3
 3, 2, 3,    2,    3

The results are automatically sorted using Index_1, and duplicate pairs (a_id, b_id) are enumerated in column @c. That is our task now is to remove all rows where @c > 1. The only problem we have is to force MySQL use Index_1 on deletion which is rather tricky without applying an additional conditions. But we can do this by using an equality check or multiple equality checks on a_id:

DELETE FROM t
USING tablename t FORCE INDEX (Index_1)
JOIN (SELECT @a_id:=0, @b_id:=0, @c:=0) as init
WHERE a_id IN (1)
  AND (@c:=IF(LEAST(@a_id=(@a_id:=a_id), @b_id=(@b_id:=b_id)), @c+1, 1)) > 1;

DELETE FROM t
USING tablename t FORCE INDEX (Index_1)
JOIN (SELECT @a_id:=0, @b_id:=0, @c:=0) as init
WHERE a_id IN (2)
  AND (@c:=IF(LEAST(@a_id=(@a_id:=a_id), @b_id=(@b_id:=b_id)), @c+1, 1)) > 1;

SELECT * FROM tablename t;

a_id, b_id
   1,    1
   1,    4
   2,    2
   2,    3

I can not put all possible a_id in IN() because MySQL will understand that the index is useless in this case and the query will not remove all duplicates (only adjacent), but having say 10 different a_id I can remove duplicates in two DELETE statements, each IN will have 5 explicit ids.

Hope, this might be useful to someone =)

Maybe this can help:

$query="SELECT * FROM shop ORDER BY id";
$rez=$dbh->query($query);
$multi=$rez->fetchAll(PDO::FETCH_ASSOC);
foreach ($multi as $key=>$row){
$rest=array_slice($multi,$key+1);
foreach ($rest as $rest){
    if(($row['id']==$rest['id']) && ($row['tax_id']==$rest['tax_id'])){
        $dbh->query("DELETE FROM shop WHERE id={$rest['id']} and tax_id=     {$rest['tax_id']}");

    }
}

}

First foreach iterates every row, and second one doing comparation. I'm using PDO, but of course, you can do it in the procedural way.