scrapy_splash

问题描述：使用scrapy框架和splash结合爬取网页运行时报如下错，没看明白
问题如下：



ScrapyDeprecationWarning: Call to deprecated function scrapy.utils.request.request_fingerprint().

If you are using this function in a Scrapy component, and you are OK with users of your component changing the fingerprinting algorithm through settings, use crawler.request_fingerprinter.fingerprint() instead in your Scrapy component (you can get the crawler object from the 'from_crawler' class method).

Otherwise, consider using the scrapy.utils.request.fingerprint() function instead.

Either way, the resulting fingerprints will be returned as bytes, not as a string, and they will also be different from those generated by 'request_fingerprint()'. Before you switch, make sure that you understand the consequences of this (e.g. cache invalidation) and are OK with them; otherwise, consider implementing your own function which returns the same fingerprints as the deprecated 'request_fingerprint()' function.
  fp = request_fingerprint(request, include_headers=include_headers)

2023-08-17 11:34:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://search.jd.com/Search?keyword=iphone via http://192.168.99.101:8050/execute> (referer: None)

爬虫程序如下：

lua_script ="""
function main(splash, args)
  splash:go(args.url)
  splash:wait(args.wait)
  splash:runjs("document.getElementsByClassName('page clearfix')[0].scrollIntoView(true)")
  splash:wait(args.wait)
  return splash:html()
end
"""



class JdSpider(scrapy.Spider):
    name = "jd"
    # allowed_domains = ["search.jd.com"]
    # start_urls = ["https://search.jd.com/Search?keyword=iphone"]
    url = 'https://search.jd.com/Search?keyword=iphone'
    #重新写start_requests方法
    def start_requests(self):
        yield SplashRequest(self.url,
                            callback=self.parse,
                            endpoint='execute',
                            args={'lua_source':lua_script,
                                  'images':0,
                                  'wait':5},
                            cache_args=['lua_source'])

提示你了因为这个函数弃用了，Scrapy里换个crawler.request_fingerprinter.fingerprin计算指纹，还有种是crapy.utils.request.fingerprint
注意实现自行浏览器搜索


<html><head><title>Failed loading page</title></head>
                <body>
                    <h1>Failed loading page (Host search.jd.com not found)</h1>
                    <h2>https://search.jd.com/Search?keyword=iphone</h2>
                    <p>Network error #3</p>
                </body></html>

【以下回答由 GPT 生成】

根据报错信息，Scrapy的某些函数已经被弃用并且不建议使用。解决这个问题的方法是使用新的函数来取代已经弃用的函数。

步骤如下：

将导致报错的request_fingerprint()函数改为使用新的fingerprint()函数。同时，将返回结果的类型从字符串(str)改为字节(bytes)。

from scrapy.utils.request import fingerprint

# 替换下面这行代码
fp = request_fingerprint(request, include_headers=include_headers)

# 用以下代码替换
fp = fingerprint(request, include_headers=include_headers)

如果你在爬虫组件中使用了上述代码，请使用crawler.request_fingerprinter.fingerprint()代替，并将返回结果的类型从字符串(str)改为字节(bytes)。

from scrapy.utils.request import fingerprint

# 替换下面这行代码
fp = request_fingerprint(request, include_headers=include_headers)

# 用以下代码替换
fp = crawler.request_fingerprinter.fingerprint(request, spider=None)

确保你理解了更改后的函数可能会导致的缓存失效等后果，并确认这些后果是你所期望和可以接受的。

以下是修改后的爬虫程序：

from scrapy.utils.request import fingerprint

lua_script = """
function main(splash, args)
  splash:go(args.url)
  splash:wait(args.wait)
  splash:runjs("document.getElementsByClassName('page clearfix')[0].scrollIntoView(true)")
  splash:wait(args.wait)
  return splash:html()
end
"""

class JdSpider(scrapy.Spider):
    name = "jd"
    # allowed_domains = ["search.jd.com"]
    # start_urls = ["https://search.jd.com/Search?keyword=iphone"]
    url = ""

    def parse(self, response):
        # 使用fingerprint()替换已经弃用的request_fingerprint()
        fp = fingerprint(response.request, include_headers=True)
        # 其他处理代码
        ...

请注意，这只是一个解决问题的方法，具体的解决方案可能因为Scrapy版本的不同而有所差异。作为资深的IT专家，我建议你查阅Scrapy和Splash的官方文档和社区支持来获取更多关于此问题的解决方案。

【相关推荐】

关于该问题，我找了一篇非常好的博客，你可以看看是否有帮助，链接：scrapy+splash爬取动态网页
除此之外, 这篇博客: scrapy中关于Splash的使用中的 安装scrapy-splash 部分也许能够解决你的问题。

如果你已经解决了该问题, 非常希望你能够分享一下解决方案, 写成博客, 将相关链接放在评论区, 以帮助更多的人 ^-^

这是函数废弃了吧

参考结合GPT4.0、文心一言，如有帮助，恭请采纳。

你正在使用的 scrapy.utils.request.request_fingerprint() 函数已经被弃用。
Scrapy 建议你使用 crawler.request_fingerprinter.fingerprint() 来代替。
crawler.request_fingerprinter.fingerprint() 是一个新的函数，它用于生成请求的指纹。这个函数会返回一个字节串，而不是字符串，并且与 request_fingerprint() 生成的指纹不同。
根据你的代码，你可以在 JdSpider 类中重写 start_requests() 方法，将 request_fingerprint() 替换为 crawler.request_fingerprinter.fingerprint()。
以下是一个修改后的示例代码：

class JdSpider(scrapy.Spider):  
    name = "jd"  
    start_urls = ["https://search.jd.com/Search?keyword=iphone"]  
  
    def start_requests(self):  
        yield SplashRequest(self.start_urls[0],  
                            callback=self.parse,  
                            endpoint='execute',  
                            args={'lua_source': self.lua_script,  
                                  'images': 0,  
                                  'wait': 0.5},  
                            fingerprint=self.fingerprint_request)  
  
    def fingerprint_request(self, request):  
        fp = request_fingerprint(request)  # 使用 crawler.request_fingerprinter.fingerprint()  
        return fp


from scrapy.utils.request import request_fingerprint

class JdSpider(scrapy.Spider):
    name = "jd"
    # ...

    def start_requests(self):
        yield SplashRequest(self.url,
                            callback=self.parse,
                            endpoint='execute',
                            args={'lua_source':lua_script,
                                  'images':0,
                                  'wait':5},
                            cache_args=['lua_source'],
                            meta={'fingerprint': request_fingerprint(self.url)})

调用了一个已经被弃用的函数，request_fingerprint()，这个函数用来生成请求的指纹，也就是一个唯一标识请求的字符串
推荐一楼的scrapy.utils.request.fingerprint()这个函数来代替

Scrapy Crawled (200) ＜GET http://www.baidu.com/＞ (referer: None)错误及解决办法
可以参考下

Scrapy Crawled (200) ＜GET http://www.baidu.com/＞ (referer: None)错误及解决办法_Znovko的博客-CSDN博客如下图所示，此错误是建立在scrapy框架建立起来的情况下，如图所示，图片左侧是scrapy框架项目结构，出现标题的错误，首先点击如图所示的settings.py文件,找到第40行,如图所示我已经框出来了,这两行刚打开文件时注释的，把这两行注释解开，并且添加"user-agent"字段,这个字段可以在浏览器里面获取，详细就不介绍了，可以自行百度，然后再次运行项目，就可以发现获取到网络的源码了...._referer: none

https://blog.csdn.net/weixin_55109596/article/details/123736893

函数 request_fingerprint() 已被标记为过时，Scrapy 建议使用 crawler.request_fingerprinter.fingerprint() 或者 scrapy.utils.request.fingerprint() 代替

这个警告是因为Scrapy的request_fingerprint函数已经被弃用，所以当你运行爬虫时，会看到这个警告。Scrapy推荐使用crawler.request_fingerprinter.fingerprint()来替代。但不该影响你的爬虫结果，可能是其他地方的问题

参考gpt
这个警告是由于您在使用Scrapy框架中的scrapy.utils.request.request_fingerprint()函数，而这个函数已经被弃用了。警告建议您使用crawler.request_fingerprinter.fingerprint()函数代替。

根据警告信息，您可以尝试将以下代码：

fp = request_fingerprint(request, include_headers=include_headers)

替换为：

crawler = self.crawler
fp = crawler.request_fingerprinter.fingerprint(request)

这样可以使用新的函数来计算请求的指纹。

请注意，新的函数返回的指纹是字节类型而不是字符串类型，如果您的代码中需要使用字符串类型的指纹，请相应地进行调整。

另外，这个警告还提到了一些关于缓存失效等方面的注意事项，如果您的代码中有使用缓存的话，您需要确保了解并适应这些变化。