This question already has an answer here:
I'm fetching an HTML webpage with file_get_contents()
, I get a table like below, there are more than 150 rows:
<tr class="tabrow ">
<td class="tabcol tdmin_2l">FIRST+DATA</td>
<td class="tabcol">
<a class="modal-button" title="SECOND+DATA" href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">
asdxxx
</a>
</td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
I want to get the FIRST DATA
, SECOND DATA
, THIRD DATA
and FOURTH DATA
with a preg_match_all()
call. I tried to write multiple patterns, but I couldn't succeed. Here's what I tried:
preg_match_all('/(<td class="tabcol tdmin_2l">|title=")(.*?)(<\/td>|")/s', $raw, $matches, PREG_SET_ORDER);
What's the true patterns?
</div>
Try this:
$str = <<<HTML
<tr class="tabrow ">
<td class="tabcol tdmin_2l">FIRST+DATA</td>
<td class="tabcol"><a class="modal-button" title="SECOND+DATA" href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">asdxxx</a></td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
HTML;
preg_match_all('/<td[^>]*>(.*?)<\/td>/im', $str, $td_matches);
preg_match('/ title="([^"]*)"/i', $td_matches[1][1], $title);
preg_match('/ href="([^"]*)"/i', $td_matches[1][1], $href);
echo $td_matches[1][0] . "
";
echo $title[1] . "
";
echo $href[1] . "
";
echo $td_matches[1][3];
It does not answer your question directly, but it's the correct way to go.
You should avoid parsing HTML/XML content with regular expressions. Wonder why?
Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.
Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.
Use a DOM parser instead. Here's a glimpse of what it's like:
composer require symfony/dom-crawler symfony/css-selector
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = <<<HTML
<tr class="tabrow ">
<td class="tabcol tdmin_2l">FIRST+DATA</td>
<td class="tabcol"><a class="modal-button" title="SECOND+DATA" href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">asdxxx</a></td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
HTML;
$crawler = new Crawler($html);
$first = $crawler->filter('.tabcol.tdmin_2l')->text();
$second = $crawler->filter('.tabcol:nth-child(2) a')->attr('title');
$third = $crawler->filter('.tabcol:nth-child(2) a')->attr('href');
$fourth = $crawler->filter('.tabcol:nth-child(4)')->text();
var_dump($first, $second, $third, $fourth);
// Outputs:
// string(10) "FIRST+DATA"
// string(11) "SECOND+DATA"
// string(10) "THIRD+DATA"
// string(11) "FOURTH+DATA"
Easier and cleaner, right?
Using such parsers, you have the ability to extract elements using XPath as well.