【scrapy 爬虫问题】 爬虫部分的parse函数没有执行,求解答

scrapy 中的爬虫部分的代码如下:

import scrapy
from bokeproject.items import BokeprojectItem
from scrapy.http import Request

class HexunspiderSpider(scrapy.Spider):
    name = 'hexunspider'
    allowed_domains = ['hexun.com']
    start_urls = ['http://27525283.blog.hexun.com/p1/default.html']
    # http://27525283.blog.hexun.com/
    # http://27525283.blog.hexun.com/p2/default.html
    print(start_urls)
    def parse(self, response):
        item = BokeprojectItem()
        item['name'] = response.xpath('//div[@class="ArticleTitle"]/span/a/text()').extract()
        item['url'] = response.xpath('//div[@class="ArticleTitle"]/span/a/@href').extract()
        item['hits'] = response.xpath('//div[@class="ArticleInfo"]/span/text()').extract()
        item['comment'] = response.xpath('//div[@class="ArticleInfo"]/a/span/text()').extract()
        print(item)
        yield item
        for j in range(2,10):
            nexturl = 'http://27525283.blog.hexun.com/p'+str(j)+'/default.html'
            print(nexturl)
            yield Request(nexturl,callback=self.parse)

同样在 settings.py中设置了

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

img

parse函数没有执行。DL们这是什么情况 是哪个地方没有设置好吗

是由于没有设置headers,被服务器禁止访问。一是添加headers,二是直接使用scrapy的Request,将如下代码添加进你的代码即可正常运行出结果:

    start_urls = ['http://27525283.blog.hexun.com/p'+str(j)+'/default.html' for j in range(1,6)]
    headers = {
        "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"
    }
    custom_settings = {
        'CONCURRENT_REQUEST_PER_DOMAIN': 2,
        'DOWNLOAD_DELAY': 1
    }

    # general crawler
    def start_requests(self):
        for url in self.start_urls:
        # HTTP request
            yield scrapy.Request(
                url=url,
                headers=self.headers,
                callback=self.parse
            )
      #在后面parse函数的里将最后的for循环部分去掉。

如有帮助,请点采纳。