this.py文件词频分析

对this.py文件生成的明文文本进行词频分析。要求不统计'!,-.'，并将"the","and","is"去除，按各单词出现次数的降序排序，输出排在前10的单词及出现次数。

基于new bing的编写：
自己可以测试！

# 读取明文文本
with open('this.py', 'r') as f:
    text = f.read()

# 定义需要去除的字符
exclude_chars = ['!', ',', '.', '\n']

# 去除exclude_chars中的字符
for ch in exclude_chars:
    text = text.replace(ch, ' ')

# 定义需要去除的词汇
exclude_words = ['the', 'and', 'is']

# 按空格进行分割
words = text.split()

# 统计单词出现次数
word_count = {}
for word in words:
    # 去除exclude_words中的单词
    if word.lower() in exclude_words:
        continue
    # 记录单词出现次数
    if word.lower() not in word_count:
        word_count[word.lower()] = 1
    else:
        word_count[word.lower()] += 1

# 按出现次数降序排序
sorted_word_count = sorted(word_count.items(), key=lambda x: x[1], reverse=True)

# 输出出现次数前10的单词及出现次数
for i in range(10):
    print(sorted_word_count[i][0], sorted_word_count[i][1])

py文件你应该自己有的吧，我们测试就不提供了，下面是代码：

import re
from collections import Counter


with open('this.py', 'r') as f:
    text = f.read()

text = re.sub(r'[!,-.]', '', text)
text = re.sub(r'\b(the|and|is)\b', '', text, flags=re.IGNORECASE)

words = re.findall(r'\b\w+\b', text)
counter = Counter(words)

for word, count in counter.most_common(10):
    print(word, count)

这个任务需要先读取this.py文件生成的明文文本，然后进行词频分析，最后筛选出前10个出现次数最多的单词。

读取文件内容并去除标点符号可以使用Python中的re和string库，对每个单词进行计数可以使用Python中的collections库中的Counter函数。

以下是代码实现：

import re
import string
from collections import Counter

with open("this.txt", "r", encoding="utf-8") as f:
    text = f.read()

# 去除标点符号
text = re.sub(r'[{}]'.format(string.punctuation.replace('-', '').replace('!', '')), ' ', text)

# 统计词频
counts = Counter(word for word in text.lower().split() if word not in ['the', 'and', 'is'])

# 打印前10个出现次数最多的单词
for word, count in counts.most_common(10):
    print(f"{word}: {count}")

不知道你这个问题是否已经解决, 如果还没有解决的话:

以下回答来自chatgpt:

我可以解决该问题。

首先，要读取文件并处理文本，首先需要打开该python文件并读取其中的文本内容，可用以下代码实现：

with open('this.py', 'r', encoding='utf-8') as f:
    content = f.read()

然后，需要将文本分割成单词并进行词频统计，去除标点符号和指定的高频词，可以使用正则表达式和Python的collections库来实现。这里给出一个函数，可以处理一个字符串并返回单词及其对应的出现次数：

import re
from collections import Counter

def count_words(text):
    # 正则表达式用于去除标点符号和数字
    pattern = r'[^\w]'
    text = re.sub(pattern, ' ', text)
    # 将单词变成小写并分割
    words = text.lower().split()
    # 移除高频词
    exclude = set(['the', 'and', 'is'])
    words = [word for word in words if word not in exclude]
    # 统计每个单词出现的次数
    word_counts = Counter(words)
    return word_counts

最后，根据单词出现次数的降序排列，输出前10个单词及其出现次数，可使用以下代码：

word_counts = count_words(content)
top_words = word_counts.most_common(10)
for word, count in top_words:
    print('{}: {}'.format(word, count))

完整的代码如下所示：

import re
from collections import Counter

def count_words(text):
    # 正则表达式用于去除标点符号和数字
    pattern = r'[^\w]'
    text = re.sub(pattern, ' ', text)
    # 将单词变成小写并分割
    words = text.lower().split()
    # 移除高频词
    exclude = set(['the', 'and', 'is'])
    words = [word for word in words if word not in exclude]
    # 统计每个单词出现的次数
    word_counts = Counter(words)
    return word_counts

with open('this.py', 'r', encoding='utf-8') as f:
    content = f.read()

word_counts = count_words(content)
top_words = word_counts.most_common(10)
for word, count in top_words:
    print('{}: {}'.format(word, count))

注意，该代码仅对该Python文件进行词频统计，不具有通用性。如果需要对其他文件或文本进行词频统计，需要相应地修改代码。另外，该代码统计的单词不考虑单复数、时态等差异，对于一些特殊情况可能会出现误差，需要根据具体需求进行调整。

如果你已经解决了该问题, 非常希望你能够分享一下解决方案, 写成博客, 将相关链接放在评论区, 以帮助更多的人 ^-^