I was trying to extract railway tickets data for internal use.
Total data looks like this table.
I have extracted every <td>
content with preg_match_all
condition but I cannot extract coach position as seen in this screenshot
I have tried code like below :
<?php
$result='tables code over here which you can find in pastebin link';
preg_match_all('/<TD class="table_border_both"><b>(.*)<\/b><\/TD>/s',$result,$matches);
var_dump($matches);
?>
I get rubbish output like:
you can use the following regular Expression:
$re = "/<TD class=\"table_border_both\"><b>([0-9][0-9])
<\/b><\/TD>/";
$str = "<table width=\"100%\" border=\"0\" cellpadding=\"0\" cellspacing=\"1\" class=\"table_border\">
<tr>
<td colspan=\"9\" class=\"heading_table_top\">Journey Details</td>
</tr>
<TR class=\"heading_table\">
<td width=\"11%\">Train Number</Td>
<td width=\"16%\">Train Name</td>
<td width=\"18%\">Boarding Date <br>(DD-MM-YYYY)</td>
<td width=\"7%\">From</Td>
<td width=\"7%\">To</Td>
<td width=\"14%\">Reserved Upto</Td>
<td width=\"21%\">Boarding Point</Td>
<td width=\"6%\">Class</Td>
</TR>
<TR>
<TD class=\"table_border_both\">*12559</TD>
<TD class=\"table_border_both\">SHIV GANGA EXP </TD>
<TD class=\"table_border_both\"> 5- 7-2014</TD>
<TD class=\"table_border_both\">BSB </TD>
<TD class=\"table_border_both\">NDLS</TD>
<TD class=\"table_border_both\">NDLS</TD>
<TD class=\"table_border_both\">BSB </TD>
<TD class=\"table_border_both\"> SL</TD>
</TR>
</table>
<TABLE width=\"100%\" border=\"0\" cellpadding=\"0\" cellspacing=\"1\" class=\"table_border\" id=\"center_table\" >
<TR>
<td width=\"25%\" class=\"heading_table_top\">S. No.</td>
<td width=\"45%\" class=\"heading_table_top\">Booking Status <br /> (Coach No , Berth No., Quota)</td>
<td width=\"30%\" class=\"heading_table_top\">* Current Status <br />(Coach No , Berth No.)</td>
<td width=\"30%\" class=\"heading_table_top\">Coach Position</td>
</TR>
<TR>
<TD class=\"table_border_both\"><B>Passenger 1</B></TD>
<TD class=\"table_border_both\"><B>S1 , 33,CK </B></TD>
<TD class=\"table_border_both\"><B>S1 , 33</B></TD>
<TD class=\"table_border_both\"><b>11
</b></TD>
</TR>
<TR>
<TD class=\"table_border_both\"><B>Passenger 2</B></TD>
<TD class=\"table_border_both\"><B>S1 , 34,CK </B></TD>
<TD class=\"table_border_both\"><B>S1 , 34</B></TD>
<TD class=\"table_border_both\"><b>11
</b></TD>
</TR>
<TR>
<TD class=\"table_border_both\"><B>Passenger 3</B></TD>
<TD class=\"table_border_both\"><B>S1 , 36,CK </B></TD>
<TD class=\"table_border_both\"><B>S1 , 36</B></TD>
<TD class=\"table_border_both\"><b>11
</b></TD>
</TR>
<TR>
<TD class=\"table_border_both\"><B>Passenger 4</B></TD>
<TD class=\"table_border_both\"><B>S1 , 37,CK </B></TD>
<TD class=\"table_border_both\"><B>S1 , 37</B></TD>
<TD class=\"table_border_both\"><b>11
</b></TD>
</TR>
<TR>
<td class=\"heading_table_top\">Charting Status</td>
<TD colspan=\"3\" align=\"middle\" valign=\"middle\" class=\"table_border_both\"> CHART PREPARED </TD>
</TR>
<TR>
<td colspan=\"4\"><font color=\"#1219e8\" size=\"1\"><b> * Please Note that in case the Final Charts have not been prepared, the Current Status might upgrade/downgrade at a later stage.</font></b></Td>
</TR>
</table>";
preg_match_all($re, $str, $matches);
Most useful website for regex: http://regex101.com/
$regexp = '/<td class="table_border_both"><b>(.*)\s*<\/b><\/td>/gi';
You have line break in "Coach position" <td>
and you forgot to mention it in regexp. And it is better to use \s*
so if you have there spaces or line brakes it won't fail.
You know that you have 4 columns, thus the result from regexp will have further transformations:
$data = array_chunk($matches, 4); // split up the matches by rows
And you have already ready rows ... few more lines and you have more than you need:
$data = array_map(function (array $row) {
return array_combine(['snum', 'status_book', 'status_cur', 'position'], $row);
}, $data); // assign each column in the row it's name
If we combine all the code, it will probably look like this:
$data = array_map(function (array $row) {
return array_combine(['snum', 'status_book', 'status_cur', 'position'], $row);
}, array_chunk($matches, 4));
Usage of \s+
is needed because there are some spaces in rows, otherwise it won't be matched
$data = file_get_contents("http://pastebin.com/raw.php?i=zJrvq95H");
preg_match_all("#<b>([0-9]{0,})\s+<\/b>#", $data, $matches);
print_r($matches[1]);
Result:
Array
(
[0] => 11
[1] => 11
[2] => 11
[3] => 11
)