哪个是比较两个网址的最佳和最快的方法?

I have two tables with list of urls fetched from different sources.

I want to find the common entries and put them in separate table.

This is what I'm doing:

  1. find md5 hash of url while fetching them.
  2. Store them in a column.
  3. I fetch one table as an array, run a loop through it and insert the values from other table where md5 hash is the same.

EDIT: Should I strip the urls of "http://' and 'www.'

I want to know any other method, which is better and faster, using which I can perform the above.

I am using PHP + MySQL

MD5 is a little bit slow if you need real speed. Try MurmurHash

You should do the following transformations before hash calculation:

  • Strip "http://" and www.
  • Strip trailing slash
  • Normalize URL (urlencode it)

Try something like:

INSERT INTO table3  (SELECT url FROM table1, table2 WHERE table1.hash = table2.hash)

That's not a valid SQL-statement, but a nested query like that should read urls from table1 and table2 that match by their hash and put them in table3.

EDIT: If you want to sanitize your input urls (e.g. removing GET-variables), I'd do that before saving them to tabel1 and table2. I wouldn't remove http and www as "https://somesite" and "http://somesite" as well as "www.somesite.com" and "somesite.com" may have different content.

SELECT * FROM table1 WHERE hash IN (SELECT hash FROM table2)

You may probably also want to have a look at the concept of table joins.