I wrote a PHP script that makes HTTP POST request using curl
and does the following,
Here is the code:
$ch = curl_init ( $url );
curl_setopt ( $ch, CURLOPT_COOKIE, "cookie=cookie");
curl_setopt ( $ch, CURLOPT_POST, 1);
curl_setopt ( $ch, CURLOPT_POSTFIELDS, $post_string);
curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt ( $ch, CURLOPT_HEADER, 0);
curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec( $ch );
// this point
extr ( $response, $param_1, $param_2);
Problem is, sometimes the response is larger than 1GB, so the PHP code pauses until, full response is received (shown in code as // this point
), and if there is malformed HTML receive, PHP generates error so, all thing here needs to do from beginning.
Here is rest of the functions:
function extr($string = '',$a,$b)
{
$doc = new DOMDocument;
@$doc -> loadHTML($string);
$table = $doc -> getElementById('myTableId');
if(is_object($table)):
foreach ($table->getElementsByTagName('tr') as $record)
{
$rec = array();
foreach ($record->getElementsByTagName('td') as $data)
{
$rec[] = $data -> nodeValue;
}
if ($rec)
{
put_data($rec);
}
}
else:
{
echo 'Skipped: Param1:'.$a.'-- Param2: '.$b.'<br>';
}
endif;
}
function put_data($one = array())
{
$one = json_encode($one) . "
";
file_put_contents("data.json", $one, FILE_APPEND);
}
ini_set('max_execution_time', 3000000);
ini_set('memory_limit', '-1');
The alternative i can think of is process data as it received, if possible , using curl, or continue previous curl request from the previous state.
Is there any possible workaround for this?
Do i need to switch to any other language than PHP for this ?
You can process the data in chunks as they come using CURLOPT_WRITEFUNCTION
option with a callback:
curl_setopt($ch, CURLOPT_WRITEFUNCTION, function(&$ch, $data) {
echo "
chunk received:
", $data; // process your chunk here
return strlen($data); // returning non-positive number aborts further transfer
});
As was already mentioned in the comments though, if your response content type is HTML that you're loading into DOMDocument, you'll need the full data first anyway.
you can do two things:
a) use a SAX parser. A Sax parser is like a DOM parser, but it can deal with streaming input where a DOM parser has to have the whole document, or it will throw errors. The Sax parser will just feed you events to process.
What is the difference between SAX and DOM?
b) when using the SAX parser, pass it data incrementally using CURLOPT_WRITEFUNCTION .. just saw that lafor also posted this, so upvoting that