目标是 爬取http://download.kaoyan.com/list-1到http://download.kaoyan.com/list-1500之间的内容,每个页面中的又有翻页的list-1p1到list-1p20。目前只能实现在list1p上面爬取,应该如何修改代码跳转到list-6上面?list-2是404
# -*- coding: utf-8 -*-
import scrapy
from Kaoyan.items import KaoyanItem
class KaoyanbangSpider(scrapy.Spider):
name = "Kaoyanbang"
allowed_domains = ['kaoyan.com']
baseurl = 'http://download.kaoyan.com/list-'
linkuseurl = 'http://download.kaoyan.com'
offset = 1
pset = 1
start_urls = [baseurl+str(offset)+'p'+str(pset)]
handle_httpstatus_list = [404, 500]
def parse(self, response):
node_list = response.xpath('//table/tr/th/span/a')
for node in node_list:
item = KaoyanItem()
item['name'] = node.xpath('./text()').extract()[0].encode('utf - 8')
item['link'] = (self.linkuseurl + node.xpath('./@href').extract()[0]).encode('utf-8')
yield item
while self.offset < 1500:
while self.pset < 50:
self.pset = self.pset + 1
url = self.baseurl+str(self.offset)+'p'+str(self.pset)
y = scrapy.Request(url, callback=self.parse)
yield y
self.offset = self.offset + 5
不知道你这个问题是否已经解决, 如果还没有解决的话:
如果你已经解决了该问题, 非常希望你能够分享一下解决方案, 以帮助更多的人 ^-^