一段代码,批量匹配关键词加粗,同时生成unmatched_keywords文件,用于储存未匹配的关键词; 同时将关键词所在的句子重新保存在一个新的txt文件中,所在的句子指的是以句号"."开始,以句号"."结束的句子,而不是整片文章,但是报错
import re
with open('keyword.txt', 'r') as file:
keyword = [word.strip() for word in file.readlines()]
with open('file.txt', 'r') as file:
article = file.read()
unmatched_keywords = []
for word in keyword:
pattern = re.compile(fr'\b{re.escape(word)}\b')
match = re.search(pattern, article)
if match:
article = re.sub(pattern, f"<b>{match.group()}<b>", article, count=1)
else:
unmatched_keywords.append(word)
with open("new.txt", "w") as file:
file.write(article)
with open('unmatched_keywords.txt', 'w') as file:
for word in unmatched_keywords:
file.write(word + '\n')
# 使用正则表达式查找所有含有关键词的句子
pattern = r'[^.]*' + str(keyword) + r'[^.]*'
sentences = re.findall(pattern,article)
# 打印结果
with open("sentences.txt", "w") as file:
file.write(sentences)
Traceback (most recent call last):
File "C:\Users\Administrator\Desktop\notepad5.txt", line 29, in <module>
sentences = re.findall(pattern,article)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\re.py", line 241, in findall
return _compile(pattern, flags).findall(string)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\re.py", line 304, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\sre_parse.py", line 948, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python38\lib\sre_parse.py", line 598, in _parse
raise source.error(msg, len(this) + 1 + len(that))
re.error: bad character range l-b at position 1923
应该如何修改?
#!/sur/bin/nve python
# coding: utf-8
import re
mypath = '/sdcard/001/test/'
with open(f'{mypath}keyword.txt', 'r') as file:
keyword = [word.strip() for word in file.readlines()]
with open(f'{mypath}file.txt', 'r') as file:
article = file.read()
unmatched_keywords = []
for word in keyword:
pattern = re.compile(fr'\b{re.escape(word)}\b')
match = re.search(pattern, article)
if match:
article = re.sub(pattern, f"<b>{match.group()}<b>", article, count=1)
else:
unmatched_keywords.append(word)
with open(f"{mypath}new.txt", "w") as file:
file.write(article)
with open(f'{mypath}unmatched_keywords.txt', 'w') as file:
for word in unmatched_keywords:
file.write(word + '\n')
# 使用正则表达式查找所有含有关键词的句子
pattern = r'[^.]*' + str(keyword) + r'[^.]*'
sentences = re.findall(pattern,article)
# 打印结果
with open(f"{mypath}sentences.txt", "w") as file:
#file.write(sentences)
file.write('\n'.join(sentences))
变量sentences是字符串列表,拼接后就好了
文件创建正常
好像生成文件内容没达成预期,我没仔细查阅。
要解决这个问题,你可以按照以下步骤进行修改:
with open('original_text.txt', 'r') as file:
text = file.read()
import re
# 提取关键词
keywords = ['keyword1', 'keyword2', 'keyword3'] # 替换为你的关键词列表
pattern = r'\b(?:{})\b'.format('|'.join(keywords))
matched_keywords = re.findall(pattern, text, flags=re.IGNORECASE)
# 提取句子
sentences = re.findall(r'(?<=\.)[^\.\n]+\.', text)
with open('matched_sentences.txt', 'w') as file:
for sentence in sentences:
file.write(sentence + '\n')
unmatched_keywords = list(set(keywords) - set(matched_keywords))
with open('unmatched_keywords.txt', 'w') as file:
for keyword in unmatched_keywords:
file.write(keyword + '\n')
以上步骤将会完成你的需求,将包含关键词的句子保存到一个新的txt文件中,并生成一个未匹配关键词的文件。如果你还有其他问题,请随时提问。