https://github.com/MIND-Lab/OCTIS
使用这个仓库的代码
不知道这一列words是什么意思?
我是想要分析它里面的model_output,看着结果一头雾水。
from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA
from octis.evaluation_metrics.diversity_metrics import TopicDiversity
from octis.evaluation_metrics.coherence_metrics import Coherence
# Load a preprocessed dataset
dataset = Dataset()
dataset.fetch_dataset("20NewsGroup")
model = LDA(num_topics=25) # Create model
model_output = model.train_model(dataset) # Train the model
# Evaluate a model
metric = TopicDiversity(topk=10) # Initialize metric
topic_diversity_score = metric.score(model_output) # Compute score of the metric
metric1 = Coherence()
topic_coherence_score = metric1.score(model_output)
所以现在不明白的有:
数据集里的words是什么?
topic-document-matrix里面的document为什么不是document的数量?
有没有做文本分析的朋友有所了解的可以帮帮忙嘛?
你通过下列代码输出可以看出,words表示的是词汇量(有可能是去掉停用词之后的)
from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset('BBC_news')
len(dataset._Dataset__corpus)
>>> 2225
len(dataset._Dataset__vocabulary)
>>> 2949
print(dataset._Dataset__corpus[0:5])
>>> [[‘broadband’, ‘ahead’, ‘join’, ‘internet’, ‘fast’, ‘accord’, ‘official’, ‘figure’, ‘number’, ‘business’, ‘connect’, ‘jump’, ‘report’, ‘broadband’, ‘connection’, ‘end’, ‘compare’, ‘nation’, ‘rank’, ‘world’, ‘telecom’, ‘body’, ‘election’, ‘campaign’, ‘ensure’, ‘affordable’, ‘high’, ‘speed’, ‘net’, ‘access’, ‘american’, ‘accord’, ‘report’, ‘broadband’, ‘increasingly’, ‘popular’, ‘research’, ‘shopping’, ‘download’, ‘music’, ‘watch’, ‘video’, ‘total’, ‘number’, ‘business’, ‘broadband’, ‘rise’, ‘end’, ‘compare’, ‘hook’, ‘broadband’, ‘subscriber’, ‘line’, ‘technology’, ‘ordinary’, ‘phone’, ‘line’, ‘support’, ‘high’, ‘data’, ‘speed’, ‘cable’, ‘lead’, ‘account’, ‘line’, ‘broadband’, ‘phone’, ‘line’, ‘connection’, ‘accord’, ‘figure’],
[‘plan’, ‘share’, ‘sale’, ‘owner’, ‘technology’, ‘dominate’, ‘index’, ‘plan’, ‘sell’, ‘share’, ‘public’, ‘list’, ‘market’, ‘operate’, ‘accord’, ‘document’, ‘file’, ‘stock’, ‘market’, ‘plan’, ‘raise’, ‘sale’, ‘observer’, ‘step’, ‘close’, ‘full’, ‘public’, ‘icon’, ‘technology’, ‘boom’, ‘recently’, ‘pour’, ‘cold’, ‘water’, ‘suggestion’, ‘company’, ‘sell’, ‘share’, ‘private’, ‘technically’, ‘public’, ‘stock’, ‘start’, ‘trade’, ‘list’, ‘equity’, ‘trade’, ‘money’, ‘sale’, ‘investor’, ‘buy’, ‘share’, ‘private’, ‘filing’, ‘document’, ‘share’, ‘technology’, ‘firm’, ‘company’, ‘high’, ‘growth’, ‘potential’, ‘symbol’, ‘internet’, ‘telecom’, ‘boom’, ‘bubble’, ‘burst’, ‘recovery’, ‘fortune’, ‘tech’, ‘giant’, ‘dot’, ‘revive’, ‘fortune’]]
第2个问题,你看这篇作者的介绍 https://towardsdatascience.com/a-beginners-guide-to-octis-optimizing-and-comparing-topic-models-is-simple-590554ec9ba6
【主题数 x 文档数】