scrapy 中的爬虫部分的代码如下:
import scrapy
from bokeproject.items import BokeprojectItem
from scrapy.http import Request
class HexunspiderSpider(scrapy.Spider):
name = 'hexunspider'
allowed_domains = ['hexun.com']
start_urls = ['http://27525283.blog.hexun.com/p1/default.html']
# http://27525283.blog.hexun.com/
# http://27525283.blog.hexun.com/p2/default.html
print(start_urls)
def parse(self, response):
item = BokeprojectItem()
item['name'] = response.xpath('//div[@class="ArticleTitle"]/span/a/text()').extract()
item['url'] = response.xpath('//div[@class="ArticleTitle"]/span/a/@href').extract()
item['hits'] = response.xpath('//div[@class="ArticleInfo"]/span/text()').extract()
item['comment'] = response.xpath('//div[@class="ArticleInfo"]/a/span/text()').extract()
print(item)
yield item
for j in range(2,10):
nexturl = 'http://27525283.blog.hexun.com/p'+str(j)+'/default.html'
print(nexturl)
yield Request(nexturl,callback=self.parse)
同样在 settings.py中设置了
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
parse函数没有执行。DL们这是什么情况 是哪个地方没有设置好吗
是由于没有设置headers,被服务器禁止访问。一是添加headers,二是直接使用scrapy的Request,将如下代码添加进你的代码即可正常运行出结果:
start_urls = ['http://27525283.blog.hexun.com/p'+str(j)+'/default.html' for j in range(1,6)]
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"
}
custom_settings = {
'CONCURRENT_REQUEST_PER_DOMAIN': 2,
'DOWNLOAD_DELAY': 1
}
# general crawler
def start_requests(self):
for url in self.start_urls:
# HTTP request
yield scrapy.Request(
url=url,
headers=self.headers,
callback=self.parse
)
#在后面parse函数的里将最后的for循环部分去掉。
如有帮助,请点采纳。