I am using Simple HTML DOM parser to scrape data and ran into question: how to gather data contained in HTLM5 microdata.
For example, <meta itemprop="title" content="Charlie and the Chocolate Factory">
How would I get both the itemprop as well as the content for meta properties using Simple HTML DOM parser?
Is the question how to iterate elements with an itemprop attribute? If so:
foreach($doc->find('[itemprop]') as $el){
echo $el->itemprop . "
";
echo $el->content . "
";
}
You could try MicrodataPHP. I haven't been keeping up with changes in the spec, but it should cover your use case, and you can file issues if something is out of line with the current spec.
A nice way to do it is using a switch statement for every itemprop element on the page.
eg:
foreach($html->find('[itemprop]') as $productDetail){
switch ($productDetail->itemprop) {
case 'image':
$line['imageURL'] = $productDetail->src;
break;
case 'price':
$line['price'] = $productDetail->plaintext; //note: plaintext not content
break;
case 'name':
$line['name'] = $productDetail->plaintext;
break;
case 'productId':
$line['productId'] = $productDetail->content;
break;
case 'description':
$line['description'] = $productDetail->content;
break;
case 'url':
$line['url'] = $productDetail->content;
break;
default:
break;
}
}
Why are you using parser for this job? Use the php function below.
http://php.net/manual/en/function.get-meta-tags.php
get_meta_tags("url");
You can try using microdata-parser which is a microdata parser library for PHP. You can feed it directly with HTML string or you can call getDocument()
on Simple HTML Dom Parser instance to get DomDocument
instance from it then feed microdata-parser
with that, and get the output as array, object or as JSON.
Or if you want to reinvent the wheel yourself, you can take a look at W3C's Microdata Specification - Converting microdata to JSON(could be a PHP array or an object if you don't convert it to JSON). Simply looking for itemprop
attributes might not the best solution if you want all the things with correct structure.