I have the following two types of text:
Type one:
<div class="meta-name">Corporate Officers</div>
<div class="meta-data"><table border="0" cellspacing="0" cellpadding="0" width="171">
<col width="171"></col>
<tbody>
<tr height="19">
<td width="171" height="19">Officer One</td>
</tr>
</tbody>
</table>
</div>
</div>
Type two:
<div class="meta-name">Corporate Officers</div>
<div class="meta-data">Officer Two</div>
</div>
<pre>
I'm using php with preg_match_all
. I need a single expression that will return Officer One and Officer Two from the above. I'm using Corporate Officers< /div>
as the first anchor and< /div>
as the second, but I can't find Keith Dennis inside all that table gibberish.
How do I return text between anchor1 and anchor2 while ignoring all text inside any brackets <>
between?
I saw these threads but wasn't able to make their solutions work for me: RegEx: extract everything until X where X is not between two braces
About 80% of regex questions is about xml/html/xhtml. And about 75% of the answer is to not use a regex. Why? Because while it may seem to work for your example it will be fragile and may break with a slight change of the input.
Please take a look at this beautiful tool. If you can't use it then come back and we will provide with help.
With SimpleXML:
$xml = new SimpleXMLElement('<div>
<div class="meta-name">
Corporate Officers
</div>
<div class="meta-data">
<table border="0" cellspacing="0" cellpadding="0" width="171">
<col width="171" />
<tbody>
<tr height="19">
<td width="171" height="19">
Officer One
</td>
</tr>
</tbody>
</table>
</div>
</div>
');
$results = array();
foreach($xml->children() as $node) {
if($node->getName() == 'div') {
$attributes = $node->attributes();
$classes = explode(' ', $attributes['class']);
if(in_array('meta-name', $classes) || in_array('meta-data', $classes)) {
$results[] = getText($node);
}
}
}
function getText($node) {
$text = trim(sprintf('%s', $node));
if(strlen($text) !== 0) {
return $text;
}
foreach($node->children() as $child) {
if($text = getText($child)) {
return $text;
}
}
return null;
}
var_dump($results);
As a general rule of thumb, never use Regex to parse HTML.
Try this regex:
'~<div\b[^>]*>Corporate\s+Officers</div>\s*<div\b[^>]*>(?:<(?!/?div\b)[^>]*>|\s+)*\K[^<]+~'
This is based on the assumption that there's no other text content in the HTML between the opening <div>
tags and the names you're looking for. The first part is self-explanatory:
<div\b[^>]*>Corporate\s+Officers</div>\s*<div\b[^>]*>
I'm assuming the "Corporate Officers" text is sufficient to locate the starting point, but you can reinsert the class
attributes if necessary. After that,
(?:<(?!/?div\b)[^>]*>|\s+)*
...consumes any number of tags other than <div>
or </div>
tags, along with any intervening whitespace. Then \K
comes along and says forget all that, the real match starts here. [^<]+
consumes everything up to the beginning of the next tag, and that's all you see in the match results. It's as if everything before the \K
was really a positive lookbehind, but without all the restrictions.
Here's a demo.