from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
# bag_of_words是上面的词频数
tfidf_transformer.fit(bag_of_words)
# 获取特征名称,上面限定的10000个
feature_names = cv.get_feature_names()
# 针对某个摘要提取,tfidf向量,是稀疏数据类型:scipy.sparse.csr.csr_matrix
**doc = corpus[534] # 随便找个摘要,本文只是单纯看一个摘要的tf-idf值**
tf_idf_vector = tfidf_transformer.transform(cv.transform([doc]))
from scipy.sparse import coo_matrix
# 数据格式转换:scipy.sparse.csr.csr_matrix ——> scipy.sparse.coo.coo_matrix
coo_matrix = tf_idf_vector.tocoo()
# coo_matrix.col表示稀疏数据不为0时对应的索引,coo_matrix.data表示稀疏数据不为0时索引下的取值
tuples = zip(coo_matrix.col, coo_matrix.data)
sorted_items = sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
# 获取tf-idf前10个最大值
sorted_items = sorted_items[:10]
score_vals = []
feature_vals = []
# idx:索引 和 tf-idf:tf-idf值
for idx, score in sorted_items:
score_vals.append(round(score, 3))
feature_vals.append(feature_names[idx])
# 把tf-idf取值最大的前10个,获取其特征名与对应的tf-idf值,放入results字典中
results = {}
for idx in range(len(feature_vals)):
results[feature_vals[idx]] = score_vals[idx]
# 结果打印出来
print('\nAbstract:')
print(doc)
print("\nkeywords:")
for k in results():
print(k, results[k])
各位大神,最好给出序号对应的列表 举例如:
0 offshoring 0.227 outsourcing offshoring decision 0.214 decision 0.208
1 geographically 0.172 outsourcing offshoring decision 0.214 decision 0.208
2 geographically0.227 outsourcing offshoring decision 0.214 decision 0.208
3 offshoring 0.227 outsourcing offshoring decision 0.214 decision 0.208