以scrapy 框架为基础爬取bilibili视频数据,出现 referer:None 问题

 

在获取一些网站数据时,代码中需要传递headerst和cookies,除了user-agent外,referer,accept,content-type,accept-encoding等有的也需要写进headers。使用scrapy如何添加这些参数,可以参考https://blog.csdn.net/weixin_44508906/article/details/87895868

import time
import scrapy
from bilibili_video.items import BilibiliVideoItem


class VideoSpider(scrapy.Spider):
    name = 'video'
    allowed_domains = ['bilibili.com']

    def start_requests(self):
        # temp_url = "https://www.bilibili.com/v/life/funny/?spm_id_from=333.5.b_6c6966655f66756e6e79.3#/all/click/0/1/2021-05-30,2021-06-06"
        for page_num in range(1, 3836):
            url = "https://www.bilibili.com/v/life/funny/?spm_id_from=333.5.b_6c6966655f66756e6e79.3#/all/click/0/" + str(page_num) + "/2021-05-30,2021-06-06"
            yield scrapy.Request(url=url, callback=self.parse, dont_filter=True)

    def parse(self, response):
        titles = response.xpath('//div[@class="r"]/a/text()').extract()
        introduces = response.xpath('//div[@class="v-desc"]/text()').extract()
        play_nums = response.xpath('//div[@class="v-info"]/span[@class="v-info-i"]/span[@class]/text()').extract()
        danmus = response.xpath('//div[@class="v-info"]/span[2][@class="v-info-i"]/span/text()').extract()
        stores = response.xpath('//div[@class="v-info"]/span[3][@class="v-info-i"]/span/text()').extract()
        up_names = response.xpath('//div[@class="up-info"]/a/text()').extract()
        dates = response.xpath('//div[@class="up-info"]/span[@class="v-date"]/text()').extract()
        length_times = response.xpath('//*[@id="videolist_box"]/div[2]/ul/li[1]/div[1]/div/a/div/span').extract()
        websites = response.xpath('//div[@class="up-info"]/a/@href').extract()

        for title,introduce,play_num,danmu,store,up_name,date,length_time,website in zip(titles,introduces,play_nums,danmus,stores,up_names,dates,length_times,websites):
            item = BilibiliVideoItem()
            item['title'] = title
            item['introduce'] = introduce
            item['play_num'] = play_num
            item['danmu'] = danmu
            item['store'] = store
            item['up_name'] = up_name
            item['date'] = date
            item['length_times'] = length_time
            item['website'] = website
            yield item

        time.sleep(2)

常见的反爬虫策略之一。

这个参数的值,表明你是从哪个网页跳转过来的。

比如说我请求获得淘宝评论的时候,他的referer是商品详情页面,表明我从这件商品详情页请求的相关评论,没有referer就不会给你这个评论

您好,我是有问必答小助手,您的问题已经有小伙伴解答了,您看下是否解决,可以追评进行沟通哦~

如果有您比较满意的答案 / 帮您提供解决思路的答案,可以点击【采纳】按钮,给回答的小伙伴一些鼓励哦~~

ps:问答VIP仅需29元,即可享受5次/月 有问必答服务,了解详情>>>https://vip.csdn.net/askvip?utm_source=1146287632