I have the html document in a php $content
. I can echo it, but I just need all the <a...>
tags with class="pret"
and after I get them I would need the non words (like a code i.e. d3852) from href attribute of <a>
and the number (i.e. 2352.2345) from between <a>
and </a>
.
I have tried more examples from the www but I either get empty arrays or php errors.
A regex example that gives me an empty array (the <a>
tag is in a table)
$pattern = "#<table\s.*?>.*?<a\s.*?class=[\"']pret[\"'].*?>(.*?)</a>.*?</table>#i";
preg_match_all($pattern, $content, $results);
print_r($results[1]);
Another example that gives just an error
$a=$content->getElementsByTagName(a);
Reason for various errors: unvalid html, non utf 8 chars.
Next I did this on another website, matched the contents in a single SQL table, and the result is a copied website with updated data from my country. No longer will I search the www for matching single results.
Let's hope you're trying to parse valid (at least valid enough) HTML document, you should use DOM
for this:
// Simple example from php manual from comments
$xml = new DOMDocument();
$xml->loadHTMLFile($url);
$links = array();
foreach($xml->getElementsByTagName('a') as $link) {
$links[] = array('url' => $link->getAttribute('href'),
'text' => $link->nodeValue);
}
Note using loadHTML
not load
(it's just more robust against errors). You also may set DOMDocument::recover
(as suggested in comment by hakre) so parser will try to recover from errors.
Or you could use xPath
(here's explanation of syntax):
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//a[@class='pret']");
if (!is_null($elements)) {
foreach ($elements as $element) {
$links[] = array('url' => $link->getAttribute('href'),
'text' => $link->nodeValue);
}
}
And for case of invalid HTML you may use regexp like this:
$a1 = '\s*[^\'"=<>]+\s*=\s*"[^"]*"'; # Attribute with " - space tolerant
$a2 = "\s*[^'\"=<>]+\s*=\s*'[^']*'"; # Attribute with ' - space tolerant
$a3 = '\s*[^\'"=<>]+\s*=\s*[\w\d]*' # Unescaped values - space tolerant
# [^'"=<>]* # Junk - I'm not inserting this to regexp but you may have to
$a = "(?:$a1|$a2|$a2)*"; # Any number of arguments
$class = 'class=([\'"])pret\\1'; # Using ?: carefully is crucial for \\1 to work
# otherwise you can use ["']
$reg = "<a{$a}\s*{$class}{$a}\s*>(.*?)</a";
And then just preg_match_all
.All regexp are written from the top of my head - you may have to debug them.
got the links like this
preg_match_all('/<a[^>]*class="pret">(.*?)<\\/a>/si', $content, $links);
print_r($links[0]);
and the result is
Array(
[0] => <a href='/word_word_34670_word_number.htm' class="pret"><span>3340.3570 word</span></a>..........)
so I need to get the first number inside href
and the number between span