快毕业了,想做一个招聘信息分析的爬虫。
一次只能爬3页,每一次爬之间要等待半小时以上,不然会报错。
上网查了半天,有的解决方法是改解码格式,有的是用gzip解压
我都试了但是没用
代码1:
def askUrl(url):
head = {
'Host': 'search.51job.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36 Edg/98.0.1108.50',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
}
request = urllib.request.Request(url,headers=head)
response = urllib.request.urlopen(request)
bs = response.read().decode('utf-8')
# bs = gzip.decompress(bs).decode('utf-8')
print(bs)
return bs
代码2:
def askUrl(url):
head = {
'Host': 'search.51job.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36 Edg/98.0.1108.50',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
}
request = urllib.request.Request(url,headers=head)
response = urllib.request.urlopen(request)
bs = response.read()
bs = gzip.decompress(bs).decode('utf-8')
print(bs)
return bs
错误1:UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
错误2:OSError: Not a gzipped file (b'\n')
我有招聘信息爬虫的代码,要吗
你要去判断到底是什么压缩格式,不能直接用gzip
有gzip,deflate,还有不压缩的