网上获取一篇近期中文时事文本素材,分词后统计词频,将高频词用wordcloud可视化为词云
import jieba
import wordcloud
f = open('素材.txt', encoding='utf-8')
text = f.read()
f.close()
words = jieba.lcut(text)
words = list(filter(lambda word: len(word) > 1, words))
counts = dict((word, text.count(word)) for word in words)
font = r'C:\Windows\Fonts\STZHONGS.TTF'
word_cloud = wordcloud.WordCloud(font, 600, 600)
word_cloud.generate_from_frequencies(counts)
word_cloud.to_file('词云.png')
可以用结巴分词(jieba)然后统计词频。可参考如下代码
import jieba
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
text = "时事文章文本"
# 使用jieba进行分词
seg_list = jieba.cut(text)
# 统计词频
word_counts = Counter(seg_list)
# 生成词云
wordcloud = WordCloud(font_path="your_font_path.ttf", width=800, height=400, background_color="white")
wordcloud.generate_from_frequencies(word_counts)
# 可视化词云
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
cmd命令行:pip install wordcloud
如果碰到报错,可能是VS环境问题,根据提示网站下载即可
建议:
可以使用Python中的jieba库进行中文分词,并使用collections库中的Counter函数进行词频统计。然后使用wordcloud库生成词云,可通过设置相关属性控制词云的样式、颜色等。最后使用matplotlib库进行可视化展示,并将结果生成为图片。具体实现步骤如下:
1.使用requests库获取最近的中文时事文本素材,可以从新闻网站等获取,比如:
import requests url = 'https://new.qq.com/omn/20211017/20211017A03Y7N00.html' res = requests.get(url) text = res.text
2.使用jieba库进行中文分词,可将分词结果保存至文件,方便后续使用:
import jieba text_cut = jieba.cut(text) text_cut_stop = [word for word in text_cut if word not in stop_words] with open('words_cutstop.txt', 'w') as file: file.write(' '.join(text_cut_stop))
3.使用collections库中的Counter函数进行词频统计:
from collections import Counter word_counts = Counter(text_cut_stop)
top_words = word_counts.most_common(20)
4.使用wordcloud库生成词云:
from wordcloud import WordCloud wc = WordCloud(font_path='simhei.ttf', scale=4, background_color='white') wc.generate_from_frequencies(word_counts)
5.可通过设置相关属性控制词云的样式、颜色等,比如设置背景颜色、最大词数、字体大小、颜色方案等:
import numpy as np import matplotlib.pyplot as plt from PIL import Image from wordcloud import ImageColorGenerator
backgroud_Image = plt.imread('bg.jpg') backgroud_Image = np.array(Image.open("bg.jpg").convert("RGBA"))
image_colors = ImageColorGenerator(backgroud_Image)
wc = WordCloud(background_color='white', max_words=100, mask=backgroud_Image, max_font_size=150, random_state=42, font_path='simhei.ttf', color_func=image_colors)
6.使用matplotlib库进行可视化展示,并将结果生成为图片:
plt.imshow(wc) plt.axis('off') plt.show() wc.to_file('output.png')
最后,便可生成一张中文时事文本素材的词云图片了。