So im creating a webcrawler and everything works, only got 1 problem.
With file_get_contents($page_data["url"]);
I get the content of a webpage. This webpage is scanned when one of my keywords excists on the webpage.
$find = $keywords; $str = file_get_contents($page_data["url"]);
if(strpos($str, $find) == true)
When i want to insert the data into mysql-database i only want the info inside the div the keyword is find in.
I know i have to use DOM but i'm new into the domdocument scene.
I solved the problem with:
$doc = new DOMDocument();
$doc->loadHTML($str);
$xPath = new DOMXpath($doc);
$xPathQuery = "//text()[contains(translate(.,'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'), '".strtoupper($keywords)."')]";
$elements = $xPath->query($xPathQuery);
if($elements->length > 0){
foreach($elements as $element){
print "Gevonden: " .$element->nodeValue."<br />";
}
I think there are some problems with your desired solution:
Usually you would use some XPATH query to search in a DOM tree, but I really don't know how to search for a node that has a child node of type "text node" with a specific keyword in it.
You might want to have a look at Lucene which offers you some search engine functionality. There are also some HTML parsers for Lucene which might be able to solve your problem.
EDIT: You might search for the next tag "before" the matched keyword and than searching for the next corresponding closing tag. But that might not actually be the closing tag of the parent DIV.
EDIT: I found a question about searching for a text within a tag: How to match a text node then follow parent nodes using XPath. So you might try to import the whole HTML into a SimpleXML or DOMDocument and than use XPath to search for the string and the parent DIV.
$str = file_get_contents($page_data["url"]);
if(strpos($str, $find) == true)
{
echo $page_data["referer_url"]. ' - gevonden';
$keywords = $_POST['keywords'];
if($page_data["header"]){
echo "<table border='1' >";
echo "<tr><td width='300'>Status:</td><td width='500'> ".strtok($page_data["header"], "
")."</td></tr>";}
else "<table border='1' >";
// PRINT EERSTE LIJN
echo "<tr><td>Page requested:</td><td> ".$page_data["url"]."</td></tr>";
// PRINT STATUS WEBSITE
// PRINT WEBPAGINA
echo "<tr><td>Referer-page:</td><td> ".$page_data["referer_url"]."</td></tr>";
// CONTENT ONTVANGEN?
if ($page_data["received"]==true)
echo "<tr><td>Content received: </td><td>".$page_data["bytes_received"] / 8 . " Kbytes</td></tr></table>";
else
{
echo "<tr><td>Content:</td><td>Not received</td></tr></table>";
}
$domain = $_POST['domain'];
$link = mysql_connect('localhost', 'crawler', 'password');
if (!$link)
{
die('Could not connect: ' . mysql_error());
}
mysql_select_db("crawler");
if(empty($page_data["referer_url"]))
$page_data["referer_url"] = $page_data["url"];
strip_tags($str, '<p><b>');
mysql_query("INSERT INTO crawler (id, domain, url, keywords, data) VALUES ('', '".$page_data["referer_url"]."', '".$page_data["url"]."', '".$keywords."', '".mysql_real_escape_string($str) . "' )");
echo "<br><br>";
echo str_pad(" ", 5000); // "Force flush", workaround
flush();
}
As you can see I already got the keywords finded, now i need the part around it. Somebody told me i have to read the page in a treestructure and after I can use the part around my founded keyword (div, p, etc.)
Maybe this will help in a general way. The code will find all elements that have both an 'id' attribute and text containing "keyword", then display the 'id' value and the text value of the element (assumes the document is well-formed):
$sxml = new SimpleXMLElement(file_get_contents($page_data['url']));
foreach ($sxml->xpath('//div[@id]') as $div) {
if (strpos((string) $div, 'keyword') !== false) {
echo $div->attributes()->id . ': ' . trim($div) . "
";
}
}