I want to scrape a html list structure, so I can save parent and child separately.
Here's the view source of html
<ul class="categories_list">
<li><a href="/sports-nutrition">Sports Nutrition</a>
<ul class="categories_list">
<li><a href="/protein">Protein</a>
<ul class="categories_list">
<li><a href="/protein-powder">Protein Powder</a>
<ul class="categories_list">
<li><a href="/whey-protein">Whey Protein</a>
<ul class="categories_list">
<li><a href="/whey-protein-isolate">Whey Protein Isolate</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<ul class="categories_list">
<li><a href="/pre-workout-supplements">Pre Workout Supplements</a></li>
</ul>
<ul class="categories_list">
<li><a href="/creatine">Creatine</a>
<ul class="categories_list">
<li><a href="/creatine-monohydrate">Creatine Monohydrate</a></li>
</ul>
</li>
</ul>
<ul class="categories_list">
<li><a href="/amino-acids">Amino Acids</a>
<ul class="categories_list">
<li><a href="/essential-amino-acids">Essential Amino Acids</a>
<ul class="categories_list">
<li><a href="/bcaa">BCAA</a></li>
</ul>
</li>
</ul>
</li>
</ul>
<ul class="categories_list">
<li><a href="/joint-supplements">Joint Supplements</a>
<ul class="categories_list">
<li><a href="/curcumin">Curcumin</a>
<ul class="categories_list">
<li><a href="/curcumin-phytosome">Curcumin Phytosome</a></li>
</ul>
</li>
</ul>
</li>
</ul>
<ul class="categories_list">
<li><a href="/energy-endurance">Energy & Endurance</a>
<ul class="categories_list">
<li><a href="/stimulants">Stimulants</a></li>
</ul>
</li>
</ul>
</li>
</ul>
I am using simple HTML DOM for scraping. I am able to get all categories, but I cannot get them in proper the hierarchy. I also tried the children approach, but that didn't work.
So I am looking for some help in my existing to make it working. Here's my existing code:
$html= file_get_html($url);
foreach ($html->find('ul.categories_list li') as $link) {
echo $link->plaintext.'<br>';
}
There is this script which tried to get all elements. This needs to be improved upon:
<?php
require_once("simple_html_dom.php");
$dom = file_get_html("source.php");
getCategory($dom);
print_r($categoryList);
function getCategory(simple_html_dom $dom){
global $categoryList;
foreach($dom->find('ul.categories_list li') as $ul){
//extract the a tag if found
$categoryName = $ul->find('a',0)->href;
$categoryLabel = $ul->find('a',0)->innertext;
$categoryList[] = array(
"categoryName" => $categoryName,
"categoryLabel" => $categoryLabel,
);
//remove a node
$ul->find('a',0)->outertext = '';
$string = $ul->innertext;
if(trim($string) == ''){
continue;
}else{
// die($string);
$dom2 = str_get_html($string);
getCategory($dom2);
}
}
}
It basically does recursion filling the $categoryList
on each call.