题目:用jiaba给一个文件里的文本进行词性切分,然后把词性标记都输入到一个txt文件中。
出现了很多错误(在最后)
#coding:utf8
import re,os,jieba.posseg
def para2sentences_ccl(para): # 这个函数负责文本里面的句子按照 句末标点 分好段。
marks = re.compile(r'。|?|……|?”|。”|!”')
if marks.search(para):
sentences_with_marks = marks.split(para)
sentences = [''.join(sentence).strip() for sentence in zip(sentences_with_marks[0::2],sentences_with_marks[1::2])]
else:
sentences = [para]
new_sentences = [s for s in sentences if s is not None]
return new_sentences
def file2sentences_CCl_pos(file_path):#用来用jieba对句子进行切分,并成立新的只有词性标记的文件
with open(file_path,'r',encoding='gb18030', errors='ignore') as fo:
raw_lines = fo.readlines() #是在整合材料成一行行。
cleaned_paras = [para2sentences_ccl(para) for para in raw_lines if para is not None] #把rawlines里面的句子按照para2sentence函数用句末标点分割。
seg_sentences = [jieba.posseg.lcut(sentence)for sentence in cleaned_paras]#把上一句里面的句子用jieba切分成一对对pair
flag_sentence = seg_sentences[1]#标出上一句切分出来的pair中的flag(词性标记)部分
names = os.path.split(file_path)#将输进来的文件路径分割成路径和文件名
txt_name = 'out'+names[1]#将上面的文件名前面加out
txt_path = file_path.join(names[0],txt_name)#即把路径和txtname文件名组合,形成一个新的文件路径以及txt文件
with open(txt_path,'r',-1,encoding='utf8')as txt:#打开这个新建文件
for flag in flag_sentence:#循环输出flag_sentences里面的flag
txt.writelines(flag + '\n')#每个flag用回车分成一段段
return flag_sentence
file2sentences_CCl_pos(r'D:\Users\DELL\Desktop\从普通女孩到银行家.txt')
以下是出现的错误:
Traceback (most recent call last):
File "C:/Users/DELL/PycharmProjects/111/homework/第五次作业.py", line 31, in
file2sentences_CCl_pos(r'D:\Users\DELL\Desktop\从普通女孩到银行家.txt')
File "C:/Users/DELL/PycharmProjects/111/homework/第五次作业.py", line 20, in file2sentences_CCl_pos
seg_sentences = [jieba.posseg.lcut(sentence)for sentence in cleaned_paras]
File "C:/Users/DELL/PycharmProjects/111/homework/第五次作业.py", line 20, in
seg_sentences = [jieba.posseg.lcut(sentence)for sentence in cleaned_paras]
File "C:\ProgramData\Anaconda3\lib\site-packages\jieba\posseg\__init__.py", line 310, in lcut
return list(cut(sentence, HMM))
File "C:\ProgramData\Anaconda3\lib\site-packages\jieba\posseg\__init__.py", line 294, in cut
for w in dt.cut(sentence, HMM=HMM):
File "C:\ProgramData\Anaconda3\lib\site-packages\jieba\posseg\__init__.py", line 249, in cut
for w in self.__cut_internal(sentence, HMM=HMM):
File "C:\ProgramData\Anaconda3\lib\site-packages\jieba\posseg\__init__.py", line 217, in __cut_internal
sentence = strdecode(sentence)
File "C:\ProgramData\Anaconda3\lib\site-packages\jieba\_compat.py", line 79, in strdecode
sentence = sentence.decode('utf-8')
AttributeError: 'list' object has no attribute 'decode'
######请问 为什么会出现这些错误呢,要怎么更改
该回答通过自己思路及引用到各个渠道搜索综合及思考,得到内容具体如下:
出现错误的原因是在进行分词操作时,将分好的句子列表传入了jieba.posseg.lcut()函数中,这导致了错误的发生。因为jieba.posseg.lcut()函数需要接收一个字符串类型的句子作为参数,而不是列表类型。
另外,代码中还有一些小错误,例如打开文件时使用了错误的参数、写入文件时使用了错误的函数等。
以下是修改后的代码,可以实现对文件的词性切分,并将词性标记写入到一个txt文件中:
import os
import jieba.posseg as pseg
def para2sentences_ccl(para):
"""将文本按照句末标点分割成句子"""
marks = re.compile(r'。|?|!|!|……|…|…”|。”|?”')
if marks.search(para):
sentences_with_marks = marks.split(para)
sentences = [''.join(sentence).strip() for sentence in zip(sentences_with_marks[0::2], sentences_with_marks[1::2])]
else:
sentences = [para]
new_sentences = [s for s in sentences if s is not None]
return new_sentences
def file2sentences_CCl_pos(file_path):
"""对文件进行词性切分,并将词性标记写入到一个txt文件中"""
with open(file_path, 'r', encoding='utf-8') as f:
text = f.read()
sentences = para2sentences_ccl(text)
flag_sentences = [pseg.lcut(sentence) for sentence in sentences]
names = os.path.split(file_path)
txt_name = 'out_' + names[1]
txt_path = os.path.join(names[0], txt_name)
with open(txt_path, 'w', encoding='utf-8') as f:
for flag_sentence in flag_sentences:
for word, flag in flag_sentence:
f.write(word + '/' + flag + ' ')
f.write('\n')
return flag_sentences
file2sentences_CCl_pos(r'D:\Users\DELL\Desktop\从普通女孩到银行家.txt')
在修改后的代码中,file2sentences_CCl_pos()函数首先读取文件的内容,并将文本按照句末标点分割成句子。然后,对每个句子进行词性切分,并将词性标记写入到一个txt文件中。在写入时,每个词和它的词性标记用“/”连接,每个句子用回车分割。
注意,在本代码中,使用了utf-8编码,因此需要保证输入文件的编码为utf-8。如果文件编码不是utf-8,可以将open()函数中的encoding参数改为相应的编码类型。
如果以上回答对您有所帮助,点击一下采纳该答案~谢谢
回答引用new Bing:
在函数file2sentences_CCl_pos中,变量cleaned_paras是一个句子列表的列表,而在另一个变量seg_sentences中,你尝试将每个句子都传递给jieba.posseg.lcut进行处理。 但是,jieba.posseg.lcut期望接收单个句子字符串作为参数,而不是句子列表。因此,在调用lcut之前,你需要将它们连接成单个字符串。
此外,当你尝试打开写入文件时,第二个参数应该是'w'而不是'r',以便以写模式打开文件。 同样,路径应该使用os.path.join而不是字符串拼接。
更改后的代码如下所示:
import os
import re
import jieba.posseg
def para2sentences_ccl(para):
# 这个函数负责文本里面的句按照 句末标点 分好段。
marks = re.compile(r'。|?|……|?”|。”|!”')
if marks.search(para):
sentences_with_marks = marks.split(para)
sentences = [''.join(sentence).strip() for sentence in zip(sentences_with_marks[0::2], sentences_with_marks[1::2])]
else:
sentences = [para]
new_sentences = [s for s in sentences if s is not None]
return new_sentences
def file2sentences_CCl_pos(file_path):
with open(file_path, 'r', encoding='gb18030', errors='ignore') as fo:
raw_lines = fo.readlines()
cleaned_paras = [para2sentences_ccl(para) for para in raw_lines if para is not None]
seg_sentences = [''.join(sentence) for sentence_list in cleaned_paras for sentence in sentence_list]
# 将每个句子列表连接成单个字符串
flag_sentence = []
for sentence in seg_sentences:
flags = [flag for word, flag in jieba.posseg.lcut(sentence)]
flag_sentence.extend(flags)
# 获取所有句子的词性,并连接它们以用于写入文件
names = os.path.split(file_path)
txt_name = 'out' + names[1]
txt_path = os.path.join(names[0], txt_name)
with open(txt_path, 'w', encoding='utf8') as txt:
txt.writelines('\n'.join(flag_sentence))
return flag_sentence
file2sentences_CCl_pos(r'D:\Users\DELL\Desktop\从普通女孩到银行家.txt')
使用这个更改后的代码,应该能够将词性标记输出到文件中。
你的问题是什么,我也做过中文分词的实验,或许对你有帮助
引用chatGPT作答,根据代码和错误信息,可以看出以下几个问题:
1.cleaned_paras 是一个二维列表,每个子列表包含多个句子,而 jieba.posseg.lcut 函数需要的输入是单独的句子。因此,在 seg_sentences 这一行的列表推导式中,需要再添加一层循环,把子列表中的每个句子都单独传入 jieba.posseg.lcut 函数进行分词。
2.在打开文件时,写模式应该使用 'w',而不是 'r'。
3.在写入文件时,应该把 flag 对象转换为字符串,因为它实际上是一个 jieba.posseg.pair 对象,而不是一个字符串。
4.在拼接文件路径时,应该使用 os.path.join 函数,而不是 join 方法。
下面是修改后的代码:
import re
import os
import jieba.posseg
def para2sentences_ccl(para):
# 将段落按照句末标点分割成句子
marks = re.compile(r'。|?|……|?”|。”|!”')
if marks.search(para):
sentences_with_marks = marks.split(para)
sentences = [''.join(sentence).strip() for sentence in zip(sentences_with_marks[0::2],sentences_with_marks[1::2])]
else:
sentences = [para]
new_sentences = [s for s in sentences if s is not None]
return new_sentences
def file2sentences_CCl_pos(file_path):
# 读取文件中的内容
with open(file_path,'r',encoding='gb18030', errors='ignore') as fo:
raw_lines = fo.readlines()
# 将文件中的段落分割成句子
sentences = []
for para in raw_lines:
cleaned_paras = para2sentences_ccl(para)
sentences.extend(cleaned_paras)
# 对句子进行分词和词性标注,并将词性标记写入文件
names = os.path.split(file_path)
txt_name = 'out'+names[1]
txt_path = os.path.join(names[0],txt_name)
with open(txt_path,'w',encoding='utf8') as txt:
for sentence in sentences:
flag_sentence = jieba.posseg.lcut(sentence)
for flag in flag_sentence:
txt.write(str(flag.flag) + '\n')
file2sentences_CCl_pos(r'D:\Users\DELL\Desktop\从普通女孩到银行家.txt')
注意,由于 jieba.posseg.lcut 函数返回的是一个 jieba.posseg.pair 对象,因此在循环中可以通过 flag.flag 来获取词性标记。同时,在写入文件时,也需要把 flag.flag 转换为字符串。
根据你提供的错误来看,是一个错误,不是多个错误,具体来说,应该是第17行那里的错误,jieba.posseg.lcut(sentence) 其中sentence不能是一个list数组,只能是一个字符串,所以报错了,你把这里的代码修改下。
para2sentence函数分割后返回的是一个list,你又用list生成器接受返回值,就两重list。所以你for循环一次时sentence是list类型,而list类型没有字符串才有的decode方法。
修改就是再增加一重for循环
import re
import os
import jieba.posseg
def para2sentences_ccl(para):
# 这个函数负责文本里面的句子按照 句末标点 分好段。
marks = re.compile(r'。|?|……|?”|。”|!”')
if marks.search(para):
sentences_with_marks = marks.split(para)
sentences = [''.join(sentence).strip() for sentence in zip(sentences_with_marks[0::2],sentences_with_marks[1::2])]
else:
sentences = [para]
new_sentences = [s for s in sentences if s is not None]
return new_sentences
def file2sentences_CCl_pos(file_path):
# 用来用jieba对句子进行切分,并成立新的只有词性标记的文件
with open(file_path,'r', encoding='gb18030', errors='ignore') as fo:
raw_lines = fo.readlines() # 是在整合材料成一行行。
cleaned_paras = [para2sentences_ccl(para) for para in raw_lines if para is not None] # 把rawlines里面的句子按照para2sentence函数用句末标点分割。
seg_sentences = [list(jieba.posseg.lcut(''.join(sentence))) for sentence in cleaned_paras] # 把上一句里面的句子用jieba切分成一对对pair
flag_sentence = seg_sentences[1] # 标出上一句切分出来的pair中的flag(词性标记)部分
names = os.path.split(file_path) # 将输进来的文件路径分割成路径和文件名
txt_name = 'out' + names[1] # 将上面的文件名前面加out
txt_path = os.path.join(names[0],txt_name) # 即把路径和txtname文件名组合,形成一个新的文件路径以及txt文件
with open(txt_path,'w',encoding='utf8') as txt: # 打开这个新建文件
for flag in flag_sentence: # 循环输出flag_sentences里面的flag
txt.write(flag + '
') # 每个flag用回车分成一段段
return flag_sentence
file2sentences_CCl_pos(r'D:\Users\DELL\Desktop\从普通女孩到银行家.txt')
seg_sentences = [jieba.posseg.lcut(sentence) for sentence in cleaned_paras]
[[(word1, flag1), (word2, flag2), ...], [(word1, flag1), (word2, flag2), ...], ...]
def file2sentences_CCl_pos(file_path):
with open(file_path, 'r', encoding='gb18030', errors='ignore') as fo:
raw_lines = fo.readlines()
cleaned_paras = [para2sentences_ccl(para) for para in raw_lines if para is not None]
seg_sentences = []
for sentence in cleaned_paras:
seg_sentence = ''
for word, flag in sentence:
seg_sentence += word
seg_sentences.append(jieba.posseg.lcut(seg_sentence))
flag_sentence = [token.flag for sentence in seg_sentences for token in sentence]
names = os.path.split(file_path)
txt_name = 'out' + names[1]
txt_path = os.path.join(names[0], txt_name)
with open(txt_path, 'w', -1, encoding='utf8') as txt:
for flag in flag_sentence:
txt.write(flag + '
')
return flag_sentence
以下内容部分参考ChatGPT模型:
首先,错误提示显示 AttributeError: 'list' object has no attribute 'decode'
,说明在调用 strdecode()
函数时,传入的参数是一个列表,而不是字符串。因此需要查看代码,找出出错的地方。
在 file2sentences_CCl_pos()
函数中,第 17 行应该是将一段文本分成了多个句子,每个句子是一个字符串,所以 cleaned_paras
列表中的元素应该是字符串类型,而不是列表类型。因此,将 para2sentences_ccl()
函数修改为如下形式即可:
def para2sentences_ccl(para):
marks = re.compile(r'。|?|……|?”|。”|!”')
if marks.search(para):
sentences_with_marks = marks.split(para)
sentences = [''.join(sentence).strip() for sentence in zip(sentences_with_marks[0::2],sentences_with_marks[1::2])]
else:
sentences = [para]
new_sentences = [s for s in sentences if s is not None]
return new_sentences
另外,第 19 行的 seg_sentences
列表中的元素是 jieba.posseg.Pair
类型的对象,而不是字符串类型,因此需要将其转换为字符串类型再进行操作。修改 file2sentences_CCl_pos()
函数的代码如下:
def file2sentences_CCl_pos(file_path):
with open(file_path,'r',encoding='gb18030', errors='ignore') as fo:
raw_lines = fo.readlines()
cleaned_paras = [para2sentences_ccl(para) for para in raw_lines if para is not None]
seg_sentences = [[str(pair.flag) for pair in jieba.posseg.lcut(sentence)] for sentence in cleaned_paras]
names = os.path.split(file_path)
txt_name = 'out'+names[1]
txt_path = os.path.join(names[0], txt_name)
with open(txt_path, 'w', encoding='utf8') as txt:
for flag_sentence in seg_sentences:
txt.write('\n'.join(flag_sentence) + '\n')
return seg_sentences
在这个函数中,首先将每个句子切分成词性标记和词语两部分,然后将词性标记转换为字符串类型,最后将词性标记写入到文件中。注意,在打开文件时,需要将模式设置为 'w'
,表示写文件。
如果我的建议对您有帮助、请点击采纳、祝您生活愉快
def getText(filepath):
f = open(filepath, 'r', encoding = 'utf-8')
text = f.read()
f.close()
return text #返回文本内容