Python一个文本文件 如何对其进行分词 去停 保留动词名词统计词频,以字典形式返回后按词频降序排列
有好多分词的包,如jieba、nltk
import jieba.posseg as psg
with open('d:\\test.txt','r') as f:
data=f.read()
seg = psg.cut(data)
dic = {s.word:s.flag for s in seg if s.flag.startswith('n') or s.flag.startswith('v')}
print('排序前的字典:\n',dic)
seg = psg.cut(data) #cut()返回的是迭代器用完就清空,再用只能再cut()
words = [s.word for s in seg if s.flag.startswith('n') or s.flag.startswith('v')]
dic = {d:dic[d] for d in sorted(dic,key=lambda x:words.count(x))}
print('排序后的字典:\n',dic)
'''
排序前的字典:
{'汽车': 'n', '库': 'n', '词性': 'n', '进行': 'v', '区分': 'n', '动词': 'n', '比如': 'v', '名词': 'n', '房子': 'n', '植物': 'n', '提取': 'v', '来': 'v'}
排序后的字典:
{'库': 'n', '词性': 'n', '进行': 'v', '区分': 'n', '比如': 'v', '房子': 'n', '植物': 'n', '来': 'v', '名词': 'n', '提取': 'v', '汽车': 'n', '动词': 'n'}
'''
库jieba只能分中文词性,英文词性要用到nltk库
jieba,nltk库