python爬虫无有效内容

使用python爬虫制作的简易网页采集器,无法返回有效内容,还是乱码
python爬虫源码:

import requests
url = "https://www.baidu.com/s?"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
}
kw = input('请输入您想查找的内容:')
password = {
    'wb': kw
}
resp = requests.get(url, params=password, headers=headers)
with open(kw+".html", "w", encoding="utf-8") as fp:
    fp.write(resp.text)
print("ok")

爬虫返回的内容是:

html>
<html lang="zh-CN">
<head>
    <meta charset="utf-8">
    <title>百度安全验证title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <meta name="apple-mobile-web-app-capable" content="yes">
    <meta name="apple-mobile-web-app-status-bar-style" content="black">
    <meta name="viewport" content="width=device-width, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0">
    <meta name="format-detection" content="telephone=no, email=no">
    <link rel="shortcut icon" href="https://www.baidu.com/favicon.ico" type="image/x-icon">
    <link rel="icon" sizes="any" mask href="https://www.baidu.com/img/baidu.svg">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <meta http-equiv="Content-Security-Policy" content="upgrade-insecure-requests">
    <link rel="stylesheet" href="https://ppui-static-wap.cdn.bcebos.com/static/touch/css/api/mkdjump_aac6df1.css" />
head>
<body>
    <div class="timeout hide-callback">
        <div class="timeout-img">div>
        <div class="timeout-title">网络不给力,请稍后重试div>
        <button type="button" class="timeout-button">返回首页button>
    div>
    <div class="timeout-feedback hide-callback">
        <div class="timeout-feedback-icon">div>
        <p class="timeout-feedback-title">问题反馈p>
    div>

<script src="https://ppui-static-wap.cdn.bcebos.com/static/touch/js/mkdjump_v2_2d634b8.js">script>
body>
html>

这种情况怎么解决!谢谢

有用请采纳
get()返回的resq你得看看encoding是不是UTF-8的编码,如果不是得重新设置。

img

img

该回答引用ChatGPT

根据你提供的代码和返回的结果来看,爬虫返回的确实是乱码,这通常是由于编码问题导致的。你的代码中使用了UTF-8编码,但是返回的网页使用了其他编码方式,所以导致了乱码。解决方法是在请求头中设置正确的编码方式。你可以尝试将请求头中的Content-Type设置为"text/html; charset=gbk",gbk是常用的中文编码方式之一。修改后的代码如下:



import requests

url = "https://www.baidu.com/s?"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "Content-Type": "text/html; charset=gbk"
}
kw = input('请输入您想查找的内容:')
password = {
    'wb': kw
}
resp = requests.get(url, params=password, headers=headers)
with open(kw+".html", "w", encoding="utf-8") as fp:
    fp.write(resp.text)
print("ok")

如果还是无法解决乱码问题,可以尝试打印出响应的编码方式,查看一下实际使用的编码方式:


print(resp.encoding)

然后根据实际编码方式修改请求头中的Content-Type即可。