python爬虫数据爬取清洗

解题思路：1.用requests.get（网页+搜索关键词）获取搜索页面链接，再get信息页面，bs4解析获取文本保存到excel.

2.从excel读取文本，re.sub(r"[^\w]+", " ", s)过滤字符串，用jieba分词words=[x for x in jieba.cut(s) if x !=' ']，获取分词列表。

3.将高频词列入stop_words列表，过滤掉，获取频数最高的10个单词及出现次数存入文本

4.用wordcloud作词云图。

import requests
import openpyxl
import json
url = "https://api.inews.qq.com/newsqa/v1/query/inner/publish/modules/list?modules=chinaDayList,chinaDayAddList,nowConfirmStatis,provinceCompare"
headers={'Host' : 'api.inews.qq.com',
'Origin' : 'https://news.qq.com',
'Referer' : 'https://news.qq.com/zt2020/page/feiyan.htm',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.4098.3 Safari/537.36'}
response = requests.post(url=url,headers=headers)
data = response.json()['data']['chinaDayAddList']
wb = openpyxl.Workbook()
ws=wb.active
ws.title ="广东疫情"
ws.append(['年份','日期','累计确诊','新增确诊数','总治愈数','总死亡数','新增死亡数','新增治愈数'])
for each in data:
    ws.append([each['y'],each['date'],each['confirm'],each['localConfirmadd'],each['heal'],each['dead'],each['deadRate'],each['healRate']])
wb.save(r"d:\gddd.xlsx")

帮忙搞出来是有长的

python爬虫 数据爬取 清洗

python爬虫数据爬取清洗