PHP抓取页面中途停止怎么办如果我下次想从停止的地方抓取的话该怎么做?
set_time_limit(0); 加上这个可以循环完
将遍历的页面的地址保存到数据库或者文件里。下次运行的时候,据此设置为循环开始的值。
#!/usr/bin/php
#--*-- coding: utf8 --*--
<?php
set_time_limit(0);
error_reporting(E_ALL^E_NOTICE);
$nextUrl = "GEN.1";
while(!empty($nextUrl)){
$userAgent = 'Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 Safari/525.13';
$ch= curl_init();
curl_setopt($ch, CURLOPT_URL,"https://wdbible.com/api/bible/chapterhtml/cunps/{$nextUrl}");
curl_setopt($ch, CURLOPT_HEADER,0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_USERAGENT,$userAgent);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
$data = curl_exec($ch);
// curl_close($ch);
if(!empty($data)){
$data = json_decode($data,true);
$content = $data['data']['content'];
}else{
echo "{$nextUrl}章节访问失败,重新访问。。。\r\n";
$data = curl_exec($ch);
}
$file1 = "D:/cmd/xml/{$nextUrl}.xml";
file_put_contents($file1,$content);
echo "{$nextUrl}.xml生成成功。\r\n";
$file = "D:/cmd/txt/{$nextUrl}.txt";
$stack = array();
$top = -1;
$xmlParser = xml_parser_create();
xml_set_element_handler($xmlParser,"Start","Stop");
xml_set_character_data_handler($xmlParser,"char");
$fp = fopen("$file1","r");
while($row = fread($fp,10000)){
xml_parse($xmlParser,$row) or
die(xml_error_string(xml_get_error_code($xmlParser),
xml_get_current_line_number($xmlParser)));
}
xml_parser_free($xmlParser);
echo "{$nextUrl}章节抓取成功。。。\r\n";
$nextUrl = $data['data']['nextChapterUsfm'];
if(!empty($nextUrl)){
echo "读取下一章节。。。\r\n";
}else{
echo "下一章节路径获取不到,重新获取。。。\r\n";
$nextUrl = $data['data']['nextChapterUsfm'];
}
}
echo "抓取结束。。。。。。\r\n";
function Start($parser, $element_name, $element_attr){
global $top,$stack;
if($element_name == "DIV" && count($element_attr) == 1){
$top++;
array_push($stack,$element_name);
$top++;
array_push($stack,$element_attr);
}else{
$top++;
array_push($stack,$element_name);
}
}
function Stop($parser, $element_name){
global $top,$stack,$file;
switch($element_name){
case "H6" :
file_put_contents($file,"\r\n",FILE_APPEND);
array_pop($stack);
$top--;
array_pop($stack);
$top--;
break;
case "H5" :
file_put_contents($file,"\r\n",FILE_APPEND);
array_pop($stack);
$top--;
array_pop($stack);
$top--;
break;
case "MARK" :
array_pop($stack);
$top--;
break;
case "SPAN" :
array_pop($stack);
$top--;
break;
case "li" :
file_put_contents($file,"\r\n",File_APPEND);
array_pop($stack);
$top--;
break;
case "DIV" :
if($stack[$top] == "DIV"){
array_pop($stack);
$top--;
}else{
file_put_contents($file,"\r\n",FILE_APPEND);
array_pop($stack);
$top--;
array_pop($stack);
$top--;
}
break;
case "P" :
array_pop($stack);
$top--;
}
}
function char($parser, $data1){
global $top,$stack,$file;
if (strlen(trim($data1)) > 0){
file_put_contents($file,$data1,FILE_APPEND);
}
}
?>
总是搞着搞着,下一个URL就访问不到了。。
有一个这样的警告。。