Is/are there any efficient testing approaches you can suggest for testing data after it has been parsed from HTML via PHP to SQL ?
To give context I am migrating data from HTML pages (containing single tables) which are numbered sequentially into a MySQL table. A Domdocument and XPath is being used to extract data DAO-style and output seems consistent. What would be the best way to check between the HTML and database (random picks, sequential, some programming algorithm...) ?
Perhaps you could use a diff algorithm to compare the original HTML to the parsed text and calculate a percentage. It obviously would never be a 100% match due to html tags and the likes but you can figure out an acceptable range and test your data that way.
I think random sampling would be best unless you have the time and processing power to test everything.
Here is a PHP implementation of a diff algorithm => http://paulbutler.org/archives/a-simple-diff-algorithm-in-php/
Because you don't have access to the raw-data but just the parsed HTML, all you can do is do exactly the same thing twice and compare.
You could also create a new DOM document based on your extracted data and compare DOM. This way you could test data which happens to be wrongly imported somehow.
But all these methods are just as reliable as the method you use to extract. And probably not worth the server load to have every import tested.
A random test would have a very low success-rate of finding errors, and you are probably better of on human eye.
You could at least build some kind of probability algorithm which notices strange behavior.
f.e. If you would parse a daily news Html page, and on a particular day you only get 3 news items, and the average news items per page should be around 10. You could of-course tweak these margins.