利用jieba库分词后提取各自的50个高频词,使用pyecharts绘制词云图进行对比
不知道你这个问题是否已经解决, 如果还没有解决的话:统计单用户发送弹幕次数,将弹幕发送次数分为四类,1次,2次,3次,>3次,并绘制弹幕发送次数饼图。
# 获取用户id(index)与发送弹幕次数(values)组成的series
series_user = df['user_id'].value_counts()
# 获取发送弹幕次数(index)与用户数量(values)组成的series
series_comment = series_user.value_counts()
# 对index按照升序排序
series_comment.sort_index(ascending=True, inplace=True)
# 发送弹幕次数列表
comment_count_list = series_comment.index
# 用户数量列表
user_count_list = series_comment.values.tolist()
# 将弹幕次数分为4类, 1次,2次,3次,大于3次
comment_count_list = [str(count) + '次' for count in comment_count_list[:3]] + ['>3次']
user_count_list = user_count_list[:3] + [sum(user_count_list[3:])]
chart = (
Pie()
.add(
"",
[list(z) for z in zip(comment_count_list, user_count_list)],
center=["35%", "50%"],
)
.set_global_opts(
title_opts=opts.TitleOpts(title="弹幕发送次数分布饼图"),
legend_opts=opts.LegendOpts(pos_left="80%", orient="vertical"),
)
.set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
# .render("pie_position.html")
)
chart.render_notebook()
![]() |
---|
根据图中可以看到,发送一条弹幕的人占了大部分,2条的有5222个用户,3条1681个用户,大于3条的有1873个用户,接下来具体看看弹幕数量 Top10
。
可以使用以下步骤实现该目标:
import jieba from collections import Counter
text = '这是一段文本,用于测试分词和提取高频词汇的功能。' words = jieba.lcut(text) word_counts = Counter(words) top50_words = word_counts.most_common(50) print(top50_words)
from pyecharts.charts import WordCloud from pyecharts import options as opts
wordcloud = WordCloud() wordcloud.add('词云图', top50_words, word_size_range=[20, 100]) wordcloud.set_global_opts(title_opts=opts.TitleOpts(title='词云图')) wordcloud.render('wordcloud.html')
注意,上述代码中的top50_words是一个列表,其中每个元素是一个元组,表示一个词汇和其出现的次数。在绘制词云图时,需要将其转化为字典格式。可以按照以下代码实现:
word_dict = dict() for word, count in top50_words: word_dict[word] = count
wordcloud.add('词云图', word_dict.items(), word_size_range=[20, 100])
完整代码如下:
import jieba from collections import Counter from pyecharts.charts import WordCloud from pyecharts import options as opts
text1 = '这是一段文本,用于测试分词和提取高频词汇的功能。' text2 = '这是另外一段文本,用于测试分词和提取高频词汇的功能。'
words1 = jieba.lcut(text1) words2 = jieba.lcut(text2) word_counts1 = Counter(words1) word_counts2 = Counter(words2) top50_words1 = word_counts1.most_common(50) top50_words2 = word_counts2.most_common(50)
wordcloud = WordCloud() word_dict1 = dict() for word, count in top50_words1: word_dict1[word] = count wordcloud.add('文本1', word_dict1.items(), word_size_range=[20, 100]) word_dict2 = dict() for word, count in top50_words2: word_dict2[word] = count wordcloud.add('文本2', word_dict2.items(), word_size_range=[20, 100]) wordcloud.set_global_opts(title_opts=opts.TitleOpts(title='词云图')) wordcloud.render('wordcloud.html')