我在安装了PyQuery库尝试练习爬虫,开始我是用加载字符串的方式去爬取数据,成功运行并得到了我想要的数据,然后当我用pyquery加载文件时,却遇到了编码问题。错误提示如下:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xaf in position 84: illegal multibyte sequence
import requests
from fake_useragent import UserAgent
from pyquery import PyQuery as pq
url = 'https://www.qidian.com/finish/'
headers = {'User-Agent':UserAgent().chrome}
resp = requests.get(url,headers=headers)
with open('tmp04.html','w',encoding='utf-8') as f:
f.write(resp.text)
# 初始化 pyquery 对象
# doc = pq(resp.text)
doc = pq(filename='tmp04.html',encoding='utf-8')
# 提取数据
all_a = doc('div.book-mid-info > h2 > a')
for a in all_a:
print(a.text)
我本来没有添加encoding,遇到错误之后我尝试添加了encoding='utf-8'/'gbk'/'gb2312',还尝试在首行添加# -- coding: UTF-8 --,都没有得到解决,连错误提示都没有改变。
求解!
将with open('tmp04.html','w',encoding='utf-8') as f:中的编码encoding='utf-8'删除即可
resp.encoding = resp.apparent_encoding
加上这句试试