解题思路:1.用requests.get(网页+搜索关键词)获取搜索页面链接,再get信息页面,bs4解析获取文本保存到excel.
2.从excel读取文本,re.sub(r"[^\w]+", " ", s)过滤字符串,用jieba分词words=[x for x in jieba.cut(s) if x !=' '],获取分词列表。
3.将高频词列入stop_words列表,过滤掉,获取频数最高的10个单词及出现次数存入文本
4.用wordcloud作词云图。
import requests
import openpyxl
import json
url = "https://api.inews.qq.com/newsqa/v1/query/inner/publish/modules/list?modules=chinaDayList,chinaDayAddList,nowConfirmStatis,provinceCompare"
headers={'Host' : 'api.inews.qq.com',
'Origin' : 'https://news.qq.com',
'Referer' : 'https://news.qq.com/zt2020/page/feiyan.htm',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.4098.3 Safari/537.36'}
response = requests.post(url=url,headers=headers)
data = response.json()['data']['chinaDayAddList']
wb = openpyxl.Workbook()
ws=wb.active
ws.title ="广东疫情"
ws.append(['年份','日期','累计确诊','新增确诊数','总治愈数','总死亡数','新增死亡数','新增治愈数'])
for each in data:
ws.append([each['y'],each['date'],each['confirm'],each['localConfirmadd'],each['heal'],each['dead'],each['deadRate'],each['healRate']])
wb.save(r"d:\gddd.xlsx")
帮忙搞出来是有长的