本人用python 爬取51job单个网页信息,刚开始爬可以成功获取网页信息,后面再爬一直报错'gbk' codec can't decode byte 0x8b in position 1: illegal multibyte sequence,尝试了网上很多方法,均未解决
import json
import urllib.request,urllib.error
import re
import random
import gzip
from io import BytesIO
import zlib
def main():
url = "https://search.51job.com/list/020000,000000,0000,00,9,99,python,2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare="
askURL(url)
def askURL(url):
head = {
"User-Agent": "Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 80.0.3987.122 Safari / 537.36",
}
request = urllib.request.Request(url,headers=head)
html = ""
try:
response = urllib.request.urlopen(request)
html = response.read().decode("gbk")
print(html)
except urllib.error.URLError as e:
if hasattr(e,"code"):
print(e.code)
if hasattr(e,"reason"):
print(e.reason)
return html
if __name__ == "__main__":
main()
Traceback (most recent call last):
File "D:\demo\shixun\demo\test51job.py", line 37, in
main()
File "D:\demo\shixun\demo\test51job.py", line 12, in main
askURL(url)
File "D:\demo\shixun\demo\test51job.py", line 25, in askURL
html = response.read().decode("gbk")
UnicodeDecodeError: 'gbk' codec can't decode byte 0x8b in position 1: illegal multibyte sequence
[Finished in 1.6s]
网上的贴子基本都尝试了一遍,要么报错,要么输出乱码!
还请各位帮忙看下是什么问题
的确,第二次就出错了, 我到浏览器也打不开这个地址。
这个有可能是反爬机制了, 你这个地址,浏览器貌似也不能刷两次。
你要考虑从 https://search.51job.com/ 开始, post 相关的关键字去做查询。
没发现啥问题。