PHP + MYSQL存储URL,应使用哪种列类型和策略

For a webscraper I will store 5 million+ full urls in a table. I will need to check if the url does not exist in the table very often.

What column type and strategy is best for performance?

  • url with varchar and an index.

Or

  • second char column with an hash ( md5?) and check for that.

Or a completely different strategy ?

Second option. If you need to check if url doesn't exist, then a unique constraint on the hash of the URL is optimal method.

md5, even though dated, is ok for your use case as you have ~5 million rows, it's not likely you'd get a collision.

For the column, use BINARY(16). You can store UNHEX(MD5('your url here')); into BINARY(16), having fixed index length.

md5 is 128 bits, you don't need human-readable representation so you can save the "raw" binary which lets you cut the storage requirements in half - hence BINARY(16) instead of BINARY(32).

The maximum width of an index is 767 bytes - but IIRC there is no upper limit on the size of a URL. This is rather long even for a URL. OTOH, the longer the data, the more work the DBMS has to do in order to compare 2 values.

Hence using a hash is probably not required but will enhance performance.

Some mysql engines support an index type of 'hash' which avoids the need to create a new column (but I don't think this is supported on MyISAM and InnoDB). OTOH recent versions of MySQL (and MariaDB, PerconaDB) support "generated" or virtual columns. So you don't need to explicitly set the value for the hash when inserting/updating (but you would need to use table.hashed=MD5('$yourURL') in the WHERE clause to allow the DBMS to use the index.

Personally, if it were me, I'd be concerned about indexing the same page twice under different URLs - particularly where the query changes:

 http://www.example.com/?r=32323

and

 http://www.example.com/?r=51515

might refer to different content or may be the same - I'd also consider keeping a hash of the content.