将关键词所在的句子重新保存在一个新的txt文件中,所在的句子指的是以句号"."开始,以句号"."结束的句子,而不是整片文章,但是报错

一段代码,批量匹配关键词加粗,同时生成unmatched_keywords文件,用于储存未匹配的关键词; 同时将关键词所在的句子重新保存在一个新的txt文件中,所在的句子指的是以句号"."开始,以句号"."结束的句子,而不是整片文章,但是报错


import re

with open('keyword.txt', 'r') as file:
    keyword = [word.strip() for word in file.readlines()]

with open('file.txt', 'r') as file:
    article = file.read()

unmatched_keywords = []
for word in keyword:
    pattern = re.compile(fr'\b{re.escape(word)}\b')
    match = re.search(pattern, article)
    if match:
        article = re.sub(pattern, f"<b>{match.group()}<b>", article, count=1)
    else:
        unmatched_keywords.append(word)


with open("new.txt", "w") as file:
    file.write(article)


with open('unmatched_keywords.txt', 'w') as file:
    for word in unmatched_keywords:
        file.write(word + '\n')

# 使用正则表达式查找所有含有关键词的句子
pattern = r'[^.]*' + str(keyword) + r'[^.]*'
sentences = re.findall(pattern,article)
 
# 打印结果
with open("sentences.txt", "w") as file:
    file.write(sentences)
Traceback (most recent call last):
  File "C:\Users\Administrator\Desktop\notepad5.txt", line 29, in <module>
    sentences = re.findall(pattern,article)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\re.py", line 241, in findall
    return _compile(pattern, flags).findall(string)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\sre_parse.py", line 598, in _parse
    raise source.error(msg, len(this) + 1 + len(that))
re.error: bad character range l-b at position 1923

应该如何修改?

  • 我run您的代码,只报最后一行“非法写入非字符”。

    img

#!/sur/bin/nve python
# coding: utf-8
import re


mypath = '/sdcard/001/test/'

with open(f'{mypath}keyword.txt', 'r') as file:
    keyword = [word.strip() for word in file.readlines()]

with open(f'{mypath}file.txt', 'r') as file:
    article = file.read()

unmatched_keywords = []

for word in keyword:
    pattern = re.compile(fr'\b{re.escape(word)}\b')
    match = re.search(pattern, article)

    if match:
        article = re.sub(pattern, f"<b>{match.group()}<b>", article, count=1)
    else:
        unmatched_keywords.append(word)

with open(f"{mypath}new.txt", "w") as file:
    file.write(article)

with open(f'{mypath}unmatched_keywords.txt', 'w') as file:
    for word in unmatched_keywords:
        file.write(word + '\n')

# 使用正则表达式查找所有含有关键词的句子
pattern = r'[^.]*' + str(keyword) + r'[^.]*'
sentences = re.findall(pattern,article)
# 打印结果

with open(f"{mypath}sentences.txt", "w") as file:
    #file.write(sentences)
    file.write('\n'.join(sentences))
  • 变量sentences是字符串列表,拼接后就好了

    img

  • 文件创建正常

    img

好像生成文件内容没达成预期,我没仔细查阅。


不知道你这个问题是否已经解决, 如果还没有解决的话:
  • 以下回答来自chatgpt:

    要解决这个问题,你可以按照以下步骤进行修改:

    1. 首先,我们需要读取原始文本文件并将其内容存储在一个字符串变量中。
    with open('original_text.txt', 'r') as file:
        text = file.read()
    
    1. 接下来,我们需要将文本中的关键词和句子提取出来。可以使用正则表达式进行匹配。
    import re
    
    # 提取关键词
    keywords = ['keyword1', 'keyword2', 'keyword3'] # 替换为你的关键词列表
    pattern = r'\b(?:{})\b'.format('|'.join(keywords))
    matched_keywords = re.findall(pattern, text, flags=re.IGNORECASE)
    
    # 提取句子
    sentences = re.findall(r'(?<=\.)[^\.\n]+\.', text)
    
    1. 然后,我们将匹配成功的句子保存到一个新的txt文件中。
    with open('matched_sentences.txt', 'w') as file:
        for sentence in sentences:
            file.write(sentence + '\n')
    
    1. 最后,我们需要生成一个包含未匹配关键词的文件。遍历所有关键词,将未匹配的关键词写入文件中。
    unmatched_keywords = list(set(keywords) - set(matched_keywords))
    
    with open('unmatched_keywords.txt', 'w') as file:
        for keyword in unmatched_keywords:
            file.write(keyword + '\n')
    

    以上步骤将会完成你的需求,将包含关键词的句子保存到一个新的txt文件中,并生成一个未匹配关键词的文件。如果你还有其他问题,请随时提问。


如果你已经解决了该问题, 非常希望你能够分享一下解决方案, 写成博客, 将相关链接放在评论区, 以帮助更多的人 ^-^