I have a html page like this
<!DOCTYPE html>
<html>
....
<body>
<div class="list-news fl pt10 ">
Blue
</div>
<div class="list-news fl pt10 alternative">
Yellow
</div>
<div class="list-news fl pt10 ">
Red
</div>
<div class="list-news fl pt10 alternative">
Cyan
</div>
<div class="list-news fl pt10 ">
Black
</div>
<div class="list-news fl pt10 alternative">
White
</div>
</body>
</html>
Now i will write a sort php code for get all content i need
preg_match_all('@<div class="list-news fl pt10 .*?">(.*?)<div class="list-news fl pt10 .*?">@s',$rs,$match);
Now this is result
[1] => Array
(
[0] => <div>Blue</div></div>
[1] => <div>Red</div></div>
[2] => <div>Black</div></div>
)
Result only show content in div <div class="list-news fl pt10 ">
and not get content in <div class="list-news fl pt10 alternative">
i can using str_replace for remove alternative
class but if don't replace this string, how can get all content in every div match class list-news fl pt10.*?
?
Thanks for idea.
A DOM approach (with a naive contains
):
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$query = <<<'EOD'
//div[
contains(@class, 'list-news') and
contains(@class, 'fl') and
contains(@class, 'pt10')]
EOD;
$nodes = $xpath->query($query);
$results = array();
foreach ($nodes as $node) {
$results[] = trim($node->textContent);
}
print_r($results);
A regex approach (with a naive pattern):
preg_match_all('~<div class="list-news fl pt10\b[^>]+>\s*\K.*?(?=\s*</div>)~',
$html, $matches);
print_r($matches[0]);
The two ways are a little naive because contains
doesn't care about word boundaries and the classes order, and the regex pattern doesn't care about the possible irregularities of an html code.
The reason your pattern doesn't work is that you can't obtain overlapping matches. Since the first occurrence ends with <div class="list-news...
, the next occurrence can't begin with the same <div class="list-news...
that has been already matched.
Putting the last <div class="list-news...
in a lookahead (?=...)
(that is only a check and where the content is not a part of the match result) can be a way. However, it is more simple to use the closing tag </div>
.
\K
is used to remove all that has been matched before (on the left) from the match result.
A good compromise can be to extract all the div tags that contain a class attribute, and after to check with a regex if the attribute value is really what you want before extracting and triming the text content:
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$query = '//div[@class]';
$nodes = $xpath->query($query);
$results = array();
foreach($nodes as $node) {
if ( preg_match('~(?:\s|^)list-news\s+fl\s+pt10(?:\s|$)~',
$node->getAttribute('class')) )
$results = trim($node->textContent);
}
or without XPath:
$dom = new DOMDocument();
@$dom->loadHTML($html);
$divs = $dom->getElementsByTagName('div');
$results = array();
foreach($divs as $node) {
if ( $node->hasAttribute('class') &&
preg_match('~(?:\s|^)list-news\s+fl\s+pt10(?:\s|$)~',
$node->getAttribute('class')) )
$results = trim($node->textContent);
}