I have two tables with list of urls fetched from different sources.
I want to find the common entries and put them in separate table.
This is what I'm doing:
EDIT: Should I strip the urls of "http://' and 'www.'
I want to know any other method, which is better and faster, using which I can perform the above.
I am using PHP + MySQL
MD5 is a little bit slow if you need real speed. Try MurmurHash
You should do the following transformations before hash calculation:
Try something like:
INSERT INTO table3 (SELECT url FROM table1, table2 WHERE table1.hash = table2.hash)
That's not a valid SQL-statement, but a nested query like that should read urls from table1 and table2 that match by their hash and put them in table3.
EDIT: If you want to sanitize your input urls (e.g. removing GET-variables), I'd do that before saving them to tabel1 and table2. I wouldn't remove http and www as "https://somesite" and "http://somesite" as well as "www.somesite.com" and "somesite.com" may have different content.
SELECT * FROM table1 WHERE hash IN (SELECT hash FROM table2)
You may probably also want to have a look at the concept of table joins.