I'm writing a PHP script to fetch website content and insert each web page into a separate post in my MySQL table. My method of choice is cURL, some regular expressions, preg_match_all, a couple of foreach loops and finally the MySQL query. The code is fine, so I won't post it here.
The problem is that the script works fine with less than about 200 posts being inserted to the database, but when the number exceeds ~200 the browser never stops loading. I want to store exactly 749 web pages (the content of each page is plain text). When I press stop and look at the database, I have around 2 000 posts in the database and around 5 duplicates of each post.
So, the conclusion I draw is that there's something (browser, server, database?) that can't handle that many pages, aborts the process and restarts it. I have tried increasing the max execution time in PHP and likewise for cURL but with the same result.
Here's a previous post where I hade some problems earlier in the process. Content from large number of web pages into array (PHP)
My question is simply: Does anyone have any idea what's going wrong here?
EDIT; OK then, since it has been requested, here's the code:
EDIT2; After some trial and error, I've discovered that the magic number is 152. The script can store the first 151 pages, but when I change the number of pages to 152 suddenly the number of posts in my database table doubles to 304. Any ideas?
EDIT3 (Good one): It turns out the script actually does what it's supposed to do on my local server. The problem comes when I run it on my web host server.
<?php
header('content-type: text/html; charset=utf-8');
// Initialize cURL
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://data.riksdagen.se/anforandelista/?anflista=true&rm=&anftyp=Nej&d=&ts=&parti=sd&iid=&sz=1000&utformat=xml');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$contents = curl_exec ($ch);
// Some regex preg_match fun
$regex = '/<anforande_url_xml>(.*?)<\/anforande_url_xml>/';
$regex1 = '/<avsnittsrubrik>(.*?)<\/avsnittsrubrik>/';
$regex2 = '/<anforande_url_xml>http:\/\/data.riksdagen.se\/anforande\/(.*?)<\/anforande_url_xml>/';
$regex3 = '/<dok_datum>(.*?)<\/dok_datum>/';
$regex4 = '/<talare>(.*?)<\/talare>/';
preg_match_all($regex, $contents, $link);
preg_match_all($regex1, $contents, $rubrik);
preg_match_all($regex2, $contents, $id);
preg_match_all($regex4, $contents, $talare);
preg_match_all($regex3, $contents, $datum);
// Display list of all posts
$j = 0;
echo "<pre>";
foreach ($link[1] as $row) {
echo $j . " <a href=\"display.php?id=" . $id[1][$j] . "\">" . $rubrik[0][$j] . "</a>" . "<br />";
$j++;
}
ini_set('max_execution_time', 300);
// Create array with all URLs
foreach ($link[1] as $row) {
$link[] = $row;
}
// Insert the core content to array
$lines = Array();
foreach ($link[1] as $row) {
$contents = file_get_contents($row);
$regex = '/<anforandetext>(.*?)<\/anforandetext>/s';
preg_match_all($regex, $contents, $output);
if (is_array($output) && isset($output[0]) && !empty($output[0])){
$lines[] = $output[1];
}
}
// Connect. Yes, I know decprecated. Later issue.
mysql_connect("host", "user", "pass") or die("Gick inte att ansluta.");
mysql_select_db("db");
// Insert into db
$h = 0;
foreach ($lines as $row) {
$utf_title = utf8_decode($rubrik[1][$h]);
$utf_speaker = utf8_decode($talare[1][$h]);
$utf_contents = utf8_decode($row[0]);
$date = utf8_decode($datum[1][$h]);
$query = "INSERT INTO riksdag (title, speaker, contents, date) VALUES('$utf_title', '$utf_speaker', '$utf_contents', '$date')";
mysql_query($query);
$h++;
}
echo "</pre>";
curl_close ($ch);
?>