非常慢的搜索和更新数据库操作

i have a table "table1" which has almost 400,000 records. There is another table "table2" which has around 450,000 records.

I need to delete all the rows in table1 which are duplicate in table2. I been trying to do it with php and the script was running for hours and not completed yet. Does it really takes that much time?

field asin is varchar(20) in table1

field ASIN is Index and char(10) in table2

$duplicat = 0;
$sql="SELECT asin from asins";
$result = $conn->query($sql);

if ($result->num_rows > 0) {
    while($row = $result->fetch_assoc()) {
        $ASIN = $row['asin'];
        $sql2 = "select id from asins_chukh where ASIN='$ASIN' limit 1";
        $result2 = $conn->query($sql2);
        if ($result2->num_rows > 0) {
            $duplicat++;
            $sql3 = "UPDATE `asins` SET `duplicate` = '1' WHERE `asins`.`asin` = '$ASIN';";
            $result3 = $conn->query($sql3);
            if($result3) {
                echo "duplicate = $ASIN <br/>";
            }
        }
    }    
}

echo "totaal :$duplicat";

u can run one single sql command, instead of a loop, something like:

update table_2 t2 
set t2.duplicate = 1
where exists (
    select id 
    from table_1 t1 
    where t1.id = t2.id);

Warning! i didn't test the sql above, so you may need to verify the syntax.

For such kind of database operation, using php to loop and join is never a good idea. Most of the time will be wasted on network data transfer between your php server and mysql server.

If even the the above sql takes too long, you can consider limiting the query set with some range. Something like:

update table_2 t2 
set t2.duplicate = 1
where exists (
    select id 
    from table_1 t1 
    where t1.id = t2.id
      and t2.id > [range_start] and t2.id < [range_end] );

This way, you can kick of several updates running in parallel

Yes, processing RBAR (Row By Agonizing Row) is going to be slow. There is overhead associated with each of those individual SELECT and UPDATE statements that get executed... sending the SQL text to the database, parsing the tokens for valid syntax (keywords, commas, expressions), validating the semantics (table references and column references valid, user has required privileges, etc.), evaluating possible execution plans (index range scan, full index scan, full table scan), converting the selected execution plan into executable code, executing the query plan (obtaining locks, accessing rows, generating rollback, writing to the innodb and mysql binary logs, etc.), and returning the results.

All of that takes time. For a statement or two, the time isn't that noticeable, but put thousands of executions into a tight loop, and it's like watching individual grains of sand falling in an hour glass.

MySQL, like most relational databases, is designed to efficiently operate on sets of data. Give the database work to do, and let the database crank, rather than spend time round tripping back and forth to the database.

It's like you've got a thousand tiny items to deliver, all to the same address. You can individually handle each item. Get a box, put the item into the box with a packing slip, seal the package, address the package, weigh the package and determine postage, affix postage, and then put it into the car, drive to the post office, drop the package off. Then drive back, and handle the next item, put it into a box, ... over and over and over.

Or, we could handle a lot of tiny items together, as a larger package, and reduce the amount of overhead work (time) packaging and round trips to and from the post office.


For one thing, there's really no need to run a separate SELECT statement, to find out if we need to do an UPDATE. We could just run the UPDATE. If there are no rows to be updated, the query will return an "affected rows" count of 0.

(Running the separate SELECT is like another round trip in the car to the post office, to check the list of packages that need to be delivered, before each round trip to the post office to drop off a package. Instead of two round trips, we can take the package with us one the first trip.)

So, that could improve things a bit. But it doesn't really get to the root of the performance problem.


The real performance boost comes from getting more work done in fewer SQL statements.

How would we identify ALL of the rows that need to be updated?

 SELECT t.asins
   FROM asins t
   JOIN asins_chukh s
     ON s.asin = t.asin
  WHERE NOT ( t.duplicate <=> '1' )

(If asin isn't unique, we need to tweak the query a bit, to avoid returning "duplicate" rows. The point is, we can write a single SELECT statement that identifies all of the rows that need to be updated.)

For non-trivial tables, for performance, we need to have suitable indexes available. In this case, we'd want indexes with a leading column of asin. If such an index doesn't exist, for example...

... ON asins_chukh (asin)

If that query doesn't return a huge number of rows, we can handle the UPDATE in one fell swoop:

 UPDATE asins t
   JOIN asins_chukh s
     ON s.asin = t.asin
    SET t.duplicate = '1' 
  WHERE NOT ( t.duplicate <=> '1' )

We need to be careful about the number of rows. We want to avoid holding blocking locks for a long time (impacting concurrent processes that may be accessing the asins table), and we want to avoid generating a huge amount of rollback.

We can break the work up into more manageable chunks.

(Referring back to the shipping tiny items analogy... if we have millions of tiny items, and putting all of those into a single shipment would create a package larger and heaver than a container ship container... we can break the shipment into manageably sized boxes.)

For example, we could handle the UPDATE in "batches" of 10,000 id values (assuming id is a unique (or nearly unique), is the leading column in the cluster key, and the id values are grouped fairly well into mostly contiguous ranges, we can get the update activity localized into one section of blocks, and not have to revist most of those same blocks again...

The WHERE clause could be something like this:

  WHERE NOT ( t.duplicate <=> 1 )
    AND t.id >= 0 
    AND t.id <  0 + 10000

For the next next batch...

  WHERE NOT ( t.duplicate <=> 1 )
    AND t.id >= 10000
    AND t.id <  10000 + 10000

Then

  WHERE NOT ( t.duplicate <=> 1 )
    AND t.id >= 20000
    AND t.id <  20000 + 10000

And so on, repeating that until we're past the maximum id value. (We could run a SELECT MAX(id) FROM asins as the first step, before the loop.)

(We want to test these statements as SELECT statements first, before we convert to an UPDATE.)

Using the id column might not be the most appropriate way to create our batches.

Our objective is to create manageable "chunks" we can put into a loop, where the chunks don't overlap the same database blocks... we won't need to revisit the same block over and over, with multiple statements, to make changes to rows within the same block multiple times.