I'm poor in regex, here is my scenario,
I'm trying to extract some info from a webpage which contains several tables, only some of the tables contains a unique url (let's say "very/unique.key"), so it will looks like this:
<table ....>
(bunch of content)
</table>
<table ....>
(bunch of content)
</table>
<table ....>
(bunch of content + "very/unique.key" keyword)
</table>
<table ....>
(bunch of content)
</table>
<table ....>
(bunch of content + "very/unique.key" keyword)
</table>
So what I want is to extract all tables' content that contains the "very/unique.key" keyword. And here are the pattern that I have tried:
$pattern = "#<table[^>]+>((?!\<table)(?=very\/unique\.key).*)<\/table>#i";
This returns nothing to me....
$pattern = "#<table[^>]+>((?!<table).*)<\/table>#i";
This will return me everything from table 1's open tag <table...>
till the last table's close tag </table>
even with the (?!<table)
condition...
Appreciate anyone who are willing to help me on this, thanks.
--EDIT--
Here is the solution that I found using DOM to loop through every table
--My Solution--
$index;//indexes of all the table(s) that contains the keyword
$cd = 0;//counter
$DOM = new DOMDocument();
$DOM->loadHTMLFile("http://uni.corp/sub/sub/target.php?key=123");
$xpath = new DomXPath($DOM);
$tables = $DOM->getElementsByTagName("table");
for ($n = 0; $n < $tables->length; $n++) {
$rows = $tables->item($n)->getElementsByTagName("tr");
for ($i = 0; $i < $rows->length; $i++) {
$cols = $rows->item($i)->getElementsbyTagName("td");
for ($j = 0; $j < $cols->length; $j++) {
$td = $cols->item($j); // grab the td element
$img = $xpath->query('./img',$td)->item(0); // grab the first direct img child element
if(isset($img) ){
$image = $img->getAttribute('src'); // grab the source of the image
echo $image;
if($image == "very/unique.key"){
echo $cols->item($j)->nodeValue, "\t";
$index[$cd] = $n;
if($n > $cd){
$cd++;
}
echo $cd . " " . $n;//for troubleshooting
}
}
}
echo "<br/>";
}
}
//loop that echo out only the table(s) that I want which contains the keyword
$loop = sizeof($index);
for ($n = 0; $n < $loop; $n++) {
$temp = $index[$n];
$rows = $tables->item($temp)->getElementsbyTagName("tr");
for ($i = 0; $i < $rows->length; $i++) {
$cols = $rows->item($i)->getElementsbyTagName("td");
for ($j = 0; $j < $cols->length; $j++) {
echo $cols->item($j)->nodeValue, "\t";
//proccess the extracted table content here
}
//echo "<br/>";
}
}
But personally, I'm still curious about the Regex part, wish anyone could found the solution of the regex pattern for this question. Anyway, thanks to everyone who are helping/advising me on this (especially to AbsoluteƵERØ).
Though I agree with the comments on your post, I will give the solution. If you wanted to replace the very/unique.key by something else, the correct regex would look something like this
#<table(.*)>((.*)very\/unique\.key(.*))<\/table>#imsU
The key here is to use the correct modifiers to make it work with your input string. FOr more information on these modifiers, see http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
Now here's an example where I replace the very/unique.key by "foobar"
<?php
$string = "
<table ....>
(bunch of content)
</table>
<table ....>
(bunch of content)
</table>
<table ....>
bunch of content very/unique.key
</table>
<table ....>
(bunch of content)
</table>
<table ....>
blabla very/unique.key
</table>
";
$pattern = '#<table(.*)>((.*)very\/unique\.key(.*))<\/table>#imsU';
echo preg_replace($pattern, '<table$1>$3foobar$4</table>', $string);
?>
This code prints exactly the same string but with the two "very/unique.key" replaced by "foobar", just like we want.
Though this solution could work, it's certainly not the most efficient nor the easiest work with. Like Mehdi said in the comments, PHP has an extension specifically made to operate on XML (thus HTML).
Here's a link to the documentation of that extension http://www.php.net/manual/en/intro.dom.php
Using that, you could easily go through each table elements and find the ones that have the unique key.
This works in PHP5. We parse the tables and the use preg_match()
to check for the key. The reason you would want to use a method like this is because HTML
does not have to be written syntactically correct unlike XML
. Because of this you may not actually have proper closing tags. Additionally you may have nested tables which would give you multiple results trying to match opening and closing tags with REGEX. This way we're only checking for the key itself and not good form of the document being parsed.
<?php
$input = "<html>
<table id='1'>
<tr>
<td>This does not contain the key.</td>
</tr>
</table>
<table id='2'>
<tr>
<td>This does contain the unique.key!</td>
</tr>
</table>
<table id='3'>
<tr>
<td>This also contains the unique.key.</td>
</tr>
</table>
</html>";
$html = new DOMDocument;
$html->loadHTML($input);
$findings = array();
$tables = $html->getElementsByTagName('table');
foreach($tables as $table){
$element = $table->nodeValue;
if(preg_match('!unique\.key!',$element)){
$findings[] = $element;
}
}
print_r($findings);
?>
Output
Array
(
[0] => This does contain the unique.key!
[1] => This also contains the unique.key.
)