统计英文段落中的词频
已知英文段落(或文章),编写程序,统计段落中各词语出现的次数,并输出排在前3的单词及其次数。
已知英文段落如下:
Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python's elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.
The Python interpreter and the extensive standard library are freely available in source or binary form for all major platforms from the Python Web site, https://www.python.org/, and may be freely distributed. The same site also contains distributions of and pointers to many free third party Python modules, programs and tools, and additional documentation.
输出用例
and :8
Python:5
to :3
def getText():
txt = open("C:/Users/Lenovo/Desktop/hamlet.txt", "r").read()
txt = txt.lower()
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
txt = txt.replace(ch," ")
return txt
hamletText = getText()
words = hamletText.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key = lambda x:x[1],reverse = True)
a=sum([len(line.split()) for line in open("C:/Users/Lenovo/Desktop/hamlet.txt", 'r')])
#print(a)
for i in range(a-1):
word,count = items[i]
print("{0:<10}{1:>5}".format(word,count))
把输入文件的地址和输出文件的地址修改一下,输入文件里添加自己的测试英文段落试试
from collections import Counter
import re
s = """Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python's elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.
The Python interpreter and the extensive standard library are freely available in source or binary form for all major platforms from the Python Web site, https://www.python.org/, and may be freely distributed. The same site also contains distributions of and pointers to many free third party Python modules, programs and tools, and additional documentation."""
result = re.split(r"[ ,:/.\n']", s)
result = [i for i in result if i]
result = Counter(result)
result = sorted(result.items(), key = lambda x: x[1],reverse = True)[:3]
for i,j in result:
print(i, ":", j)
'''--result
and : 8
Python : 5
to : 3
'''
def get_file_words(path, num):
"""
用Python实现统计一篇英文文章内每个单词出现频率,并返回出现频率最高的10个单词及其出现次数,并解答以下问题
1)创建文件对象f后,解释f的readlines和xreadlines方法的区别?
2)追加需求,引号内元素需要算作一个单词,如何实现?
:return:
"""
list_words = []
obj_file = open(path, "r")
text = obj_file.read()
obj_file.close()
# 引号内元素需要算作一个单词,先用引号切分,偶数再划分单词,奇数直接算作一个单词加入列表
list_text = text.split('"')
for i in range(0, len(list_text), 2):
#
list_words += re.split("[0-9\W]+", list_text[i])
if i+1 < len(list_text):
list_words.append(list_text[i+1])
obj_count = Counter(list_words)
result = obj_count.most_common(num)
#print(list_words)
return result
if name == 'main':
print(get_file_words("aa.txt", 10))
import re
txt = '''Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python's elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.
The Python interpreter and the extensive standard library are freely available in source or binary form for all major platforms from the Python Web site, https://www.python.org/, and may be freely distributed. The same site also contains distributions of and pointers to many free third party Python modules, programs and tools, and additional documentation.
'''
li = re.findall(r'[A-Za-z]+',txt)
dic = {}
for v in li:
dic[v] = dic.get(v,0) + 1
res = sorted(dic.items(),key=lambda x: x[1],reverse=True)
for k,v in res[:3]:
print(f'{k}:{v}')
如有帮助,请点击我的回答下方的【采纳该答案】按钮帮忙采纳下,谢谢!
dic={}
x=0
while True:
s=input()
if s=='!!!!!':
break
for ch in '!,?.:*':
s=s.replace(ch,' ')
s=s.lower()
l=s.split()
for i in l:
if i in dic:
dic[i]+=1
else:
dic[i]=1
x+=1
li=list(dic.items())#转换成列表
li.sort(key=lambda x:x[0])#按字母排序
li.sort(key=lambda x:x[1],reverse=True)#排数字
dic=dict(li[0:10])
print(x)
for i in dic.keys():
print("%s=%d"%(i,dic[i]))
代码如上,后续如果还有不理解的可以私聊我
1.读取文件,通过lower()、replace()函数将所有单词统一为小写,并用空格替换特殊字符。
def gettext():
txt = open("piao.txt","r",errors='ignore').read()
txt = txt.lower()
for ch in '!"#$&()*+,-./:;<=>?@[\\]^_{|}·~‘’':
txt = txt.replace(ch,"")
return txt
2.对处理后的文本进行词频统计存入字典。
txt = gettext()
words = txt.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
3.统计结果存为列表类型,按词频由高到低进行排序,输出前十位。
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
for i in range(10):
word,count = items[i]
print("{0:<10}{1:>5}".format(word,count))