以下电影采集地址已失效,比如我要采集:https://www.kanjuwang.net/detail/?241517.html 这个站点影片,以下代码需要怎么填?
<?php
header("Content-type:text/html;charset=utf-8");
include '../inc/config.php';
$managername = $_COOKIE["managername"];
if(!$managername){
echo "<script>alert('请登陆!'); parent.window.location.href = 'login.php';</script>";
}else{
$sql = "SELECT * FROM `" . $mysql_pre_name . "manager` WHERE `m_name` LIKE '" . $managername . "' LIMIT 0, 1 ";
$check_query = mysql_query($sql);
$result = mysql_fetch_array($check_query);
$managerlevel = $result['m_level'];
setcookie("managername", $managername, time() + 3600);
if($managerlevel>1){
echo "<script>alert('您的管理权限为".$managerlevel.",无权进行此操作!'); parent.window.location.href = './';</script>";
exit;
}
}
?>
<input onclick="window.location.href='?page=<?=$_GET['page']?>&num=<?=$_GET['num']?>&exit=1'" type="submit" value="停止采集" />
<input onclick="window.location.href='?page=<?=$_GET['page']?>&num=<?=$_GET['num']?>'" type="submit" value="开始采集" />
<input onclick="window.location.href='<?='http://' . $_SERVER['SERVER_NAME'] . $_SERVER["SCRIPT_NAME"];?>'" type="submit" value="从头采集" />
<br /><br />
<?php
if(isset($_GET['exit']) && $_GET['exit']==1){ exit; }
if(!$_GET['page']){
$page = 1;
}else{
$page = $_GET['page'];
}
if(!$_GET['num']){
$num = 0;
}else{
$num = $_GET['num'];
}
if($_GET['page'] > 4){
echo '采集完成';exit;
}
$type = 10;
$yurl = "http://www.iqiyi.com/lib/dianying/,,_11_".$page.".html";
$purl = file_get_contents($yurl);
preg_match_all('/class="site-piclist_info_title">(.*)<\/p>/imsU',$purl,$href1);
foreach($href1[1] as $kh => $vh){
preg_match_all('/http:\/\/www.iqiyi.com\/lib\/m_(.*).html/imsU',$vh,$href);
foreach($href[1] as $khr){
$urlds[] = $khr;
}
}
$nums = $urlds[$num];
$urld = file_get_contents("http://www.iqiyi.com/lib/m_".$nums.".html");
preg_match('/data-doc-id="(.*)"/imsU',$urld,$tid);
preg_match('/片名:(.*);/imsU',$urld,$titles);
preg_match('/主演:(.*);/imsU',$urld,$starrings);
preg_match('/导演:(.*);/imsU',$urld,$directeds);
preg_match('/<div class="look_point">(.*)<\/div>/imsU',$urld,$tags);
preg_match_all('/>(.*)<\/a>/imsU',$tags[1],$tagss);
foreach($tagss[1] as $kat){
$tst = str_ireplace(",","",$kat);
$tst = str_ireplace("\n","",$tst);
$tst = str_ireplace("\r","",$tst);
$str.= $tst.',';
$tag = rtrim($str, ",");
}
preg_match('/data-movlbshowmore-ele="whole">(.*)<\/p>/imsU',$urld,$contents);
preg_match('/<div class="result_pic">(.*)<\/div>/imsU',$urld,$pic1);
preg_match('/<img(.*)src="(.*)"/imsU',$pic1[1],$pic2);
$domin = file_get_contents('http://search.video.iqiyi.com/m?if=video_library&video_library_type=play_source&platform=1&key='.$tid[1]);
$json_domin = json_decode($domin,true);
$info = $json_domin['video_info'];
$from = $json_domin['site'];
$content = $contents[1];
foreach($info as $karr){
// $title = $titles[1]; //标题
$title = $karr['title']; //标题
$pic = $pic2[2]; //图片
$starring = str_ireplace("、",",",$starrings[1]); //主演
$directed = $directeds[1]; //导演
$play_url = Current(explode('?',$karr['play_url']));
$playurl .= $title.'$'.$play_url."#"; //播放地址组合
}
$group = "
标题:$title<br/>
标签:$tag<br/>
图片:$pic<br/>
主演:$starring<br/>
导演:$directed<br/>
来源:$from<br/>
播放地址:$playurl<br/><br/>
";
$sql = "SELECT * FROM `".$mysql_pre_name."vod` WHERE `d_name` LIKE '$title' AND `d_type` = ".$type;
$query = mysql_query($sql);
$row = mysql_fetch_array($query);
$did = $row['d_id'];
$d_name = $row['d_name'];
if(!$d_name){
$mysql = true;
}else{
$mysql = false;
}
if($mysql){
if($play_url==''||$title==''||$from==''||$pic==''){
echo $group.'电影《'.$title.'》播放地址为空,入库失败,3秒后继续<script>window.setTimeout("window.location=\'?page='.$page.'&num='.($num+1).'\'",1000); </script>';
if($_GET['num'] > 28){
echo '<script>window.setTimeout("window.location=\'?page='.($page+1).'&num=0\'",1000); </script>';
}
exit;
}
$sqlrk = "INSERT INTO `".$mysql_pre_name."vod` (`d_name`, `d_pic`, `d_picthumb`, `d_picslide`, `d_writer`, `d_starring`, `d_directed`, `d_tag`, `d_remarks`, `d_type`, `d_level`, `d_usergroup`, `d_addtime`, `d_content`, `d_playfrom`, `d_playurl`, `d_reading`) VALUES ('$title', '$pic', '', '', '$managername', '$starring', '$directed', '$tag', '', '$type', '0', '0', NOW(), '$content', '$from', '$playurl', '152');";
$result = mysql_query($sqlrk);
if ($result) {
echo $group.'电影《'.$title.'》采集入库成功,3秒后继续<script>window.setTimeout("window.location=\'?page='.$page.'&num='.($num+1).'\'",1000); </script>';
}else{
echo $group.'电影《'.$title.'》采集出现错误,入库失败,3秒后继续<script>window.setTimeout("window.location=\'?page='.$page.'&num='.($num+1).'\'",1000); </script>';
}
}else{
$sqlrk = "UPDATE `$mysql_database`.`".$mysql_pre_name."vod` SET `d_playurl` = '$playurl', `d_addtime` = NOW() WHERE `".$mysql_pre_name."vod`.`d_id` = $did;";
$result = mysql_query($sqlrk);
if ($result){
echo $group.'电影《'.$title.'》已存在,无需采集,直接覆盖播放地址,3秒后继续<script>window.setTimeout("window.location=\'?page='.$page.'&num='.($num+1).'\'",1000); </script>';
}else{
echo $group.'电影《'.$title.'》采集出现错误,入库失败,3秒后继续<script>window.setTimeout("window.location=\'?page='.$page.'&num='.($num+1).'\'",1000); </script>';
}
}
if($_GET['num'] > 28){
echo '<script>window.setTimeout("window.location=\'?page='.($page+1).'&num=0\'",1000); </script>';
}
// print_r($karr);
?>
爬虫程序是需要根据你的需求来编写的。你给的例子里,应该是一个电影列表,通过循环把每一页的电影信息都爬取出来。而你问题中给的url是一个电影的详细页,通过正则表达式来获取你的信息。
改什么地方都不行,两个网站格式都不一样
1.从磁盘扫描所有的文件,放入队列
LinkedBlockingQueue fileQueue = new LinkedBlockingQueue(32);
fileQueue.put(fileNamelist);
2.开多线程处理队列中的文件
public void handle(int threadNum){
//开多线程处理队列中的文件
ExecutorService fixedThreadPool = Executors.newFixedThreadPool(threadNum);
for(int i=0;i<threadNum;i++){
if(getFileQueue() == null){
break;
}else {
fixedThreadPool.execute(new InsertThread(getFileQueue()));
}
}
fixedThreadPool.shutdown();
while (!fixedThreadPool.isTerminated()){
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
3. InsertThread里run方法具体实现:
while(true){
PreparedStatement pstmt = null;
String path=fileQueue.poll();
if(pathnull){
//队列为空了,则线程退出
break;
}
//TODO读取文件,入库
File file = new File(path);
InputStreamReader isr = new InputStreamReader(new FileInputStream(file), “utf8”);
BufferedReader bw = new BufferedReader(isr);
String line = null;
connection =
DBUtil.getInstance().getLocalConnection(“jdbc:mysql://localhost:3306”,“acctbd”,“root”,“root”);
String sql = “replace into " + “acct_item_20c” + " (” + “ACCT_ITEM_ID,ACCT_ID,ACCT_ITEM_TYPE_ID,AMOUNT,BILL_ID,BILLING_CYCLE_ID,” +
“CREATE_DATE,CUST_ID,FEE_CYCLE_ID,GRP_ACCT_ITEM_TYPE_NBR,HAD_INVOICE_AMOUNT,ITEM_SOURCE_ID,NO_INVOICE_AMOUNT,OFFER_INST_ID” +
“,ORI_ACCT_ITEM_ID,ONE_ACCT_ITEM_ID,PAY_CYCLE_ID,PAYMENT_METHOD,PRESENT_AMOUNT,PROD_INST_ID,STATUS_CD,STATUS_DATE,REGION_ID,” +
“PARTNER_ID,BILL_XCHG_ID,PRD_ID,OFFER_ID,UNSURE_INCOME,EVENT_PRICING_STRATEGY_ID,LATN_ID,PLATFORM,card_flag)” + " values " +
“(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)”;
pstmt = connection.prepareStatement(sql);
//这两个变量作为批量入库的判定标识
int idx=1;
int leftIdx=0;
while((line = bw.readLine()) != null){
if(“null”.equals(arr[4]) || StringUtils.isBlank(arr[4])){
pstmt.setNull(5,Types.DECIMAL);
}else {
pstmt.setBigDecimal(5,new BigDecimal(arr[4]));
}
…
pstmt.addBatch();
//N条提交一次
if(idx%10000){
pstmt.executeBatch();
//删除批次
pstmt.clearBatch();
//connection.commit();
//剩余量设置为0
leftIdx=0;
}else{
leftIdx++;
}
idx++;
}
//最后再做一次检测
if (leftIdx>0){
pstmt.executeBatch();
//删除批次
pstmt.clearBatch();
//connection.commit();
}
bw.close();
}
}
这里涉及到几个问题要注意下:
1.两个过程可以同时进行,边读边写提升效率。
2. 源生的jdbc塞一个null值,不是空,是一个为字符串的null值。还有字段设置为可以为null值的字段,要进行判断,要严谨。
if(“null”.equals(arr[4]) || StringUtils.isBlank(arr[4])){
pstmt.setNull(5,Types.DECIMAL);
}else {
pstmt.setBigDecimal(5,new BigDecimal(arr[4]));
}
3.批量提交的时候,要考虑每一种发生的情况,使性能最大化。
4.插库如表的时候 replace into 用于这个表的主键有自增。还有Ingore 方法也可以具体没试。
5.这里如果想知道成功插入了多少条,异常了多少条。方法具体如下:
public AtomicLong dataRecord;定义一个AtomicLong类型的变量,
while((line = bw.readLine()) != null){
dataRecord20C.incrementAndGet();
添加数据方法…
}
AtomicLong dataRecord20C=new AtomicLong();
AtomicLong dataRecord20P=new AtomicLong();
AtomicLong dataRecordBalance=new AtomicLong();
long startInsert=System.currentTimeMillis();
ExecutorService fixedThreadPool = Executors.newFixedThreadPool(insertThreadNum);
for(int i=0;i<insertThreadNum;i++){
fixedThreadPool.execute(new InsertThread(fileQueue, latnId, billingCycle,dataRecord20C,dataRecord20P,dataRecordBalance));
}
fixedThreadPool.shutdown();
while (!fixedThreadPool.isTerminated()){
try {
Thread.sleep(1000*60);
logger.info("has insert 20C RECORDS [{}],20P RECORDS [{}],BALANCE RECORDS [{}]",dataRecord20C,dataRecord20P,dataRecordBalance);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
6.读取数据库的数据时,如果数据量很多千万级别的,千万注意防止GC挂机问题。因为数据量过大,resultSet集肯定很大,所以要给予一定的限制,可以加上这几句。
preparedStatement = conn.prepareStatement(sql,resultSet.TYPE_FORWARD_ONLY,resultSet.CONCUR_READ_ONLY);
//执行sql前,添加这两句
preparedStatement.setFetchSize(Integer.MIN_VALUE);
preparedStatement.setFetchDirection(ResultSet.FETCH_REVERSE);
resultSet = preparedStatement.executeQuery();
Ok Fine ~
下面是一个简化的例子,修改了$yurl,并更新了正则表达式以匹配新网站的结构。您需要根据实际情况进行调整:
$type = 10;
$yurl = "https://www.kanjuwang.net/detail/?241517.html";
$purl = file_get_contents($yurl);
preg_match('/<h1 class="title">(.*)<\/h1>/imsU', $purl, $titles); // 提取电影标题
preg_match('/<span class="actor">主演:(.*)<\/span>/imsU', $purl, $starrings); // 提取主演
// ...
// 在这里添加其他匹配规则以提取新网站的其他信息
你提供的代码中,是爬取另一个电影网站的数据,跟你现在要爬取的电影网站的网页格式完全是不一样的,所以你提供的代码参考意义不大。对于普通网站的爬取,你可以使用python程序来爬取和解析。解析可以使re正则库或者beatufulsoup来解析。比如解析电影名称:
soup = BeautifulSoup(html,'lxml')
title = soup.find('h1',attrs={'class':'title text-fff'})
name = title.get_text()