So I have this function for making non-blocking curl requests. It works fine on what I've tested so far (small amounts of requests). But I need this to scale up to thousands of requests (maybe max 10,000). My issue is that I don't want to run into issues with too many parallel requests running at once.
What would you suggest to rate-limit the requests? Usleep? Requests in batches? The function is below:
function poly_curl($requests){
$queue = curl_multi_init();
$curl_array = array();
$count = 0;
foreach($requests as $request)
{
$curl_array[$count] = curl_init($request);
curl_setopt($curl_array[$count], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($queue, $curl_array[$count]);
$count++;
}
$running = NULL;
do {
curl_multi_exec($queue,$running);
} while($running > 0);
$res = array();
$count = 0;
foreach($requests as $request)
{
$res[$count] = curl_multi_getcontent($curl_array[$count]);
$count++;
}
$count = 0;
foreach($requests as $request){
curl_multi_remove_handle($queue, $curl_array[$count]);
$count++;
}
curl_multi_close($queue);
return $res;
}
You may want to implement something that is called Exponential Backoff (wikipedia).
Basically, it is an algorithm that allows you dynamically scale your processes depending on some feedback.
You define a rate in your application, and on the first time-out, error, or anything you decide, you decrease this rate until the request finishes.
You may implement it easily using the HTTP response code for example.
I think curl_multi_exec
is bad for this purpose, because even if you use batches in groups of 100, 99 request could be finished and still will have to wait for the last request completion.
But you need 100 parallel requests and when one finishes, another is immediately started. So you cannot use curl_multi_exec
at all.
I would use normal producer-consumer algorithm with multiple (constant number) consumers with every consumer processing only one url. For example php-resque and COUNT=100 php resque.php
Last time i was doing something like this it was including downloading and "parsing" files. Was able to proceed only 4 subpages at a time limited by very weak hardware processor (2 cores with HT). What time i ended up with two queuest: 1 for waiting, 2 for in-process. Every time a task gone from 2nd queue, new was taken from 1st one.
May saund complicated, but ended in two loops inside eachother and simple count()
's
Btw, considering so hight rate i would think of using Node.js - for simplicity - or anything more nonblocking and more suitable for deamons than PHP.. As long as threads are PHP weakpoint, it just does not suit there.
PS: nice & useful bit of code, thanks.
We used to face the same problem with C++ connection pooling code. The approach is those days involved some serious analysis.
But, the essence was that, we created a pool and requests would get processed depending on number of available requests. What we also did was assign a maximum number of connection pools.[This was determined by testing].
What you really need is a method to determine how many requests are being processed and put a limit to it. In your case that is $count
Just compare $count to a maximum value[say, $max] to it and stop there. define the value depending on the system the program runs. $max could be hardcoded or dynamic.