python中jieba.lcut切分词性标记的相关问题

题目：用jiaba给一个文件里的文本进行词性切分，然后把词性标记都输入到一个txt文件中。
出现了很多错误（在最后）

以下是源代码和我的编写思路

#coding:utf8
import re,os,jieba.posseg
def para2sentences_ccl(para): # 这个函数负责文本里面的句子按照 句末标点 分好段。
    marks = re.compile(r'。|？|……|？”|。”|！”')
    if marks.search(para):
        sentences_with_marks = marks.split(para)
        sentences = [''.join(sentence).strip() for sentence in zip(sentences_with_marks[0::2],sentences_with_marks[1::2])]
    else:
        sentences = [para]
    new_sentences = [s for s in sentences if s is not None]
    return new_sentences

def file2sentences_CCl_pos(file_path):#用来用jieba对句子进行切分，并成立新的只有词性标记的文件
    with open(file_path,'r',encoding='gb18030', errors='ignore') as fo:
        raw_lines = fo.readlines() #是在整合材料成一行行。
    cleaned_paras = [para2sentences_ccl(para) for para in raw_lines if para is not None] #把rawlines里面的句子按照para2sentence函数用句末标点分割。
    seg_sentences = [jieba.posseg.lcut(sentence)for sentence in cleaned_paras]#把上一句里面的句子用jieba切分成一对对pair
    flag_sentence = seg_sentences[1]#标出上一句切分出来的pair中的flag（词性标记）部分
    names = os.path.split(file_path)#将输进来的文件路径分割成路径和文件名
    txt_name = 'out'+names[1]#将上面的文件名前面加out
    txt_path = file_path.join(names[0],txt_name)#即把路径和txtname文件名组合，形成一个新的文件路径以及txt文件
    with open(txt_path,'r',-1,encoding='utf8')as txt:#打开这个新建文件
        for flag in flag_sentence:#循环输出flag_sentences里面的flag
            txt.writelines(flag + '\n')#每个flag用回车分成一段段
    return flag_sentence

file2sentences_CCl_pos(r'D:\Users\DELL\Desktop\从普通女孩到银行家.txt')

以下是出现的错误：

Traceback (most recent call last):
  File "C:/Users/DELL/PycharmProjects/111/homework/第五次作业.py", line 31, in 
    file2sentences_CCl_pos(r'D:\Users\DELL\Desktop\从普通女孩到银行家.txt')
  File "C:/Users/DELL/PycharmProjects/111/homework/第五次作业.py", line 20, in file2sentences_CCl_pos
    seg_sentences = [jieba.posseg.lcut(sentence)for sentence in cleaned_paras]
  File "C:/Users/DELL/PycharmProjects/111/homework/第五次作业.py", line 20, in 
    seg_sentences = [jieba.posseg.lcut(sentence)for sentence in cleaned_paras]
  File "C:\ProgramData\Anaconda3\lib\site-packages\jieba\posseg\__init__.py", line 310, in lcut
    return list(cut(sentence, HMM))
  File "C:\ProgramData\Anaconda3\lib\site-packages\jieba\posseg\__init__.py", line 294, in cut
    for w in dt.cut(sentence, HMM=HMM):
  File "C:\ProgramData\Anaconda3\lib\site-packages\jieba\posseg\__init__.py", line 249, in cut
    for w in self.__cut_internal(sentence, HMM=HMM):
  File "C:\ProgramData\Anaconda3\lib\site-packages\jieba\posseg\__init__.py", line 217, in __cut_internal
    sentence = strdecode(sentence)
  File "C:\ProgramData\Anaconda3\lib\site-packages\jieba\_compat.py", line 79, in strdecode
    sentence = sentence.decode('utf-8')
AttributeError: 'list' object has no attribute 'decode'

######请问为什么会出现这些错误呢，要怎么更改

该回答通过自己思路及引用到各个渠道搜索综合及思考,得到内容具体如下：

出现错误的原因是在进行分词操作时，将分好的句子列表传入了jieba.posseg.lcut()函数中，这导致了错误的发生。因为jieba.posseg.lcut()函数需要接收一个字符串类型的句子作为参数，而不是列表类型。

另外，代码中还有一些小错误，例如打开文件时使用了错误的参数、写入文件时使用了错误的函数等。

以下是修改后的代码，可以实现对文件的词性切分，并将词性标记写入到一个txt文件中：

import os
import jieba.posseg as pseg

def para2sentences_ccl(para):
    """将文本按照句末标点分割成句子"""
    marks = re.compile(r'。|？|！|!|……|…|…”|。”|？”')
    if marks.search(para):
        sentences_with_marks = marks.split(para)
        sentences = [''.join(sentence).strip() for sentence in zip(sentences_with_marks[0::2], sentences_with_marks[1::2])]
    else:
        sentences = [para]
    new_sentences = [s for s in sentences if s is not None]
    return new_sentences

def file2sentences_CCl_pos(file_path):
    """对文件进行词性切分，并将词性标记写入到一个txt文件中"""
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()
    sentences = para2sentences_ccl(text)
    flag_sentences = [pseg.lcut(sentence) for sentence in sentences]
    names = os.path.split(file_path)
    txt_name = 'out_' + names[1]
    txt_path = os.path.join(names[0], txt_name)
    with open(txt_path, 'w', encoding='utf-8') as f:
        for flag_sentence in flag_sentences:
            for word, flag in flag_sentence:
                f.write(word + '/' + flag + ' ')
            f.write('\n')
    return flag_sentences

file2sentences_CCl_pos(r'D:\Users\DELL\Desktop\从普通女孩到银行家.txt')

在修改后的代码中，file2sentences_CCl_pos()函数首先读取文件的内容，并将文本按照句末标点分割成句子。然后，对每个句子进行词性切分，并将词性标记写入到一个txt文件中。在写入时，每个词和它的词性标记用“/”连接，每个句子用回车分割。

注意，在本代码中，使用了utf-8编码，因此需要保证输入文件的编码为utf-8。如果文件编码不是utf-8，可以将open()函数中的encoding参数改为相应的编码类型。

如果以上回答对您有所帮助，点击一下采纳该答案～谢谢

回答引用new Bing：
在函数file2sentences_CCl_pos中，变量cleaned_paras是一个句子列表的列表，而在另一个变量seg_sentences中，你尝试将每个句子都传递给jieba.posseg.lcut进行处理。但是，jieba.posseg.lcut期望接收单个句子字符串作为参数，而不是句子列表。因此，在调用lcut之前，你需要将它们连接成单个字符串。

此外，当你尝试打开写入文件时，第二个参数应该是'w'而不是'r'，以便以写模式打开文件。同样，路径应该使用os.path.join而不是字符串拼接。

更改后的代码如下所示：

import os
import re
import jieba.posseg


def para2sentences_ccl(para):
    # 这个函数负责文本里面的句按照 句末标点 分好段。
    marks = re.compile(r'。|？|……|？”|。”|！”')
    if marks.search(para):
        sentences_with_marks = marks.split(para)
        sentences = [''.join(sentence).strip() for sentence in zip(sentences_with_marks[0::2], sentences_with_marks[1::2])]
    else:
        sentences = [para]
    new_sentences = [s for s in sentences if s is not None]
    return new_sentences


def file2sentences_CCl_pos(file_path):
    with open(file_path, 'r', encoding='gb18030', errors='ignore') as fo:
        raw_lines = fo.readlines()
    cleaned_paras = [para2sentences_ccl(para) for para in raw_lines if para is not None]
    seg_sentences = [''.join(sentence) for sentence_list in cleaned_paras for sentence in sentence_list] 
    # 将每个句子列表连接成单个字符串
    flag_sentence = []
    for sentence in seg_sentences:
        flags = [flag for word, flag in jieba.posseg.lcut(sentence)]
        flag_sentence.extend(flags)
    # 获取所有句子的词性，并连接它们以用于写入文件
    names = os.path.split(file_path)
    txt_name = 'out' + names[1]
    txt_path = os.path.join(names[0], txt_name)
    with open(txt_path, 'w', encoding='utf8') as txt:
        txt.writelines('\n'.join(flag_sentence))
    return flag_sentence


file2sentences_CCl_pos(r'D:\Users\DELL\Desktop\从普通女孩到银行家.txt')

使用这个更改后的代码，应该能够将词性标记输出到文件中。

你的问题是什么，我也做过中文分词的实验，或许对你有帮助

引用chatGPT作答，根据代码和错误信息，可以看出以下几个问题：

1.cleaned_paras 是一个二维列表，每个子列表包含多个句子，而 jieba.posseg.lcut 函数需要的输入是单独的句子。因此，在 seg_sentences 这一行的列表推导式中，需要再添加一层循环，把子列表中的每个句子都单独传入 jieba.posseg.lcut 函数进行分词。

2.在打开文件时，写模式应该使用 'w'，而不是 'r'。

3.在写入文件时，应该把 flag 对象转换为字符串，因为它实际上是一个 jieba.posseg.pair 对象，而不是一个字符串。

4.在拼接文件路径时，应该使用 os.path.join 函数，而不是 join 方法。

下面是修改后的代码：

import re
import os
import jieba.posseg

def para2sentences_ccl(para):
    # 将段落按照句末标点分割成句子
    marks = re.compile(r'。|？|……|？”|。”|！”')
    if marks.search(para):
        sentences_with_marks = marks.split(para)
        sentences = [''.join(sentence).strip() for sentence in zip(sentences_with_marks[0::2],sentences_with_marks[1::2])]
    else:
        sentences = [para]
    new_sentences = [s for s in sentences if s is not None]
    return new_sentences

def file2sentences_CCl_pos(file_path):
    # 读取文件中的内容
    with open(file_path,'r',encoding='gb18030', errors='ignore') as fo:
        raw_lines = fo.readlines()
    
    # 将文件中的段落分割成句子
    sentences = []
    for para in raw_lines:
        cleaned_paras = para2sentences_ccl(para)
        sentences.extend(cleaned_paras)

    # 对句子进行分词和词性标注，并将词性标记写入文件
    names = os.path.split(file_path)
    txt_name = 'out'+names[1]
    txt_path = os.path.join(names[0],txt_name)
    with open(txt_path,'w',encoding='utf8') as txt:
        for sentence in sentences:
            flag_sentence = jieba.posseg.lcut(sentence)
            for flag in flag_sentence:
                txt.write(str(flag.flag) + '\n')

file2sentences_CCl_pos(r'D:\Users\DELL\Desktop\从普通女孩到银行家.txt')

注意，由于 jieba.posseg.lcut 函数返回的是一个 jieba.posseg.pair 对象，因此在循环中可以通过 flag.flag 来获取词性标记。同时，在写入文件时，也需要把 flag.flag 转换为字符串。

根据你提供的错误来看，是一个错误，不是多个错误，具体来说，应该是第17行那里的错误，jieba.posseg.lcut(sentence) 其中sentence不能是一个list数组，只能是一个字符串，所以报错了，你把这里的代码修改下。

para2sentence函数分割后返回的是一个list，你又用list生成器接受返回值，就两重list。所以你for循环一次时sentence是list类型，而list类型没有字符串才有的decode方法。
修改就是再增加一重for循环

该回答引用ChatGPT
根据错误提示，可以看到错误发生于`strdecode`函数中，提示`'list' object has no attribute 'decode'`，即传入的`sentence`参数是一个列表对象（list），而列表对象没有`decode`这个方法。可以发现，出现这个错误的原因是在调用`jieba.posseg.lcut`函数时，将一个句子列表`cleaned_paras`作为参数传入。

`jieba.posseg.lcut`函数是用来对句子进行分词并打上词性标注的，因此它需要处理的是单个字符串，并不适用于处理整个句子列表。因此，需要将传入`jieba.posseg.lcut`的句子修改为单个字符串而不是整个句子列表。

这里有两种可以考虑的修改方案：

1. 在`para2sentences_ccl`函数中，将每个句子转换成一个字符串，然后由`file2sentences_CCl_pos`函数调用`para2sentences_ccl`函数得到的句子列表`cleaned_paras`已经单个字符串，直接将`seg_sentences`中的每个句子用`.join`方法连接成一个字符串即可传入`jieba.posseg.lcut`函数。

2. 直接将整个文件读入内存，使用`"".join(raw_lines)`将所有的句子拼接成一个字符串，然后传入`jieba.posseg.lcut`函数进行分词处理。

下面给出第一种方案的修改代码：


import re
import os
import jieba.posseg

def para2sentences_ccl(para):
    # 这个函数负责文本里面的句子按照 句末标点 分好段。
    marks = re.compile(r'。|？|……|？”|。”|！”')
    if marks.search(para):
        sentences_with_marks = marks.split(para)
        sentences = [''.join(sentence).strip() for sentence in zip(sentences_with_marks[0::2],sentences_with_marks[1::2])]
    else:
        sentences = [para]
    new_sentences = [s for s in sentences if s is not None]
    return new_sentences

def file2sentences_CCl_pos(file_path):
    # 用来用jieba对句子进行切分，并成立新的只有词性标记的文件
    with open(file_path,'r', encoding='gb18030', errors='ignore') as fo:
        raw_lines = fo.readlines() # 是在整合材料成一行行。
    cleaned_paras = [para2sentences_ccl(para) for para in raw_lines if para is not None] # 把rawlines里面的句子按照para2sentence函数用句末标点分割。
    seg_sentences = [list(jieba.posseg.lcut(''.join(sentence))) for sentence in cleaned_paras] # 把上一句里面的句子用jieba切分成一对对pair
    flag_sentence = seg_sentences[1] # 标出上一句切分出来的pair中的flag（词性标记）部分
    names = os.path.split(file_path) # 将输进来的文件路径分割成路径和文件名
    txt_name = 'out' + names[1] # 将上面的文件名前面加out
    txt_path = os.path.join(names[0],txt_name) # 即把路径和txtname文件名组合，形成一个新的文件路径以及txt文件
    with open(txt_path,'w',encoding='utf8') as txt: # 打开这个新建文件
        for flag in flag_sentence: # 循环输出flag_sentences里面的flag
            txt.write(flag + '
') # 每个flag用回车分成一段段
    return flag_sentence

file2sentences_CCl_pos(r'D:\Users\DELL\Desktop\从普通女孩到银行家.txt')

其中，`list(jieba.posseg.lcut(''.join(sentence)))`这行代码将每个句子列表`sentence`用`.`连接成一个字符串，然后再对其进行分词处理。该回答引用ChatGPT
从错误信息中可以看出，代码的问题出在对列表对象进行了解码（decode）这一操作上，但是列表对象是不支持解码操作的。

具体来说，在代码中，出现错误的是以下这句：


seg_sentences = [jieba.posseg.lcut(sentence) for sentence in cleaned_paras]

这句话中，cleaned_paras是一个二维列表，每个元素是一个由jieba分好的词（包括词语和词性），如下所示：


[[(word1, flag1), (word2, flag2), ...], [(word1, flag1), (word2, flag2), ...], ...]

而jieba.posseg.lcut只能对字符串进行分词，无法对列表进行操作。因此，我们需要将列表中的每一个词语提取出来，拼接成字符串再进行分词。更改后的代码如下所示：


def file2sentences_CCl_pos(file_path):
    with open(file_path, 'r', encoding='gb18030', errors='ignore') as fo:
        raw_lines = fo.readlines()
    cleaned_paras = [para2sentences_ccl(para) for para in raw_lines if para is not None]
    seg_sentences = []
    for sentence in cleaned_paras:
        seg_sentence = ''
        for word, flag in sentence:
            seg_sentence += word
        seg_sentences.append(jieba.posseg.lcut(seg_sentence))
    flag_sentence = [token.flag for sentence in seg_sentences for token in sentence]
    names = os.path.split(file_path)
    txt_name = 'out' + names[1]
    txt_path = os.path.join(names[0], txt_name)
    with open(txt_path, 'w', -1, encoding='utf8') as txt:
        for flag in flag_sentence:
            txt.write(flag + '
')
    return flag_sentence

在修改后的代码中，我们使用了两层循环对每个句子进行拆分，把其中的词语提取出来拼接成字符串，再对字符串进行分词，并把得到的词性标记存储在flag_sentence中，最后写入txt文件。

以下内容部分参考ChatGPT模型：
首先，错误提示显示 AttributeError: 'list' object has no attribute 'decode'，说明在调用 strdecode() 函数时，传入的参数是一个列表，而不是字符串。因此需要查看代码，找出出错的地方。

在 file2sentences_CCl_pos() 函数中，第 17 行应该是将一段文本分成了多个句子，每个句子是一个字符串，所以 cleaned_paras 列表中的元素应该是字符串类型，而不是列表类型。因此，将 para2sentences_ccl() 函数修改为如下形式即可：

def para2sentences_ccl(para):
    marks = re.compile(r'。|？|……|？”|。”|！”')
    if marks.search(para):
        sentences_with_marks = marks.split(para)
        sentences = [''.join(sentence).strip() for sentence in zip(sentences_with_marks[0::2],sentences_with_marks[1::2])]
    else:
        sentences = [para]
    new_sentences = [s for s in sentences if s is not None]
    return new_sentences

另外，第 19 行的 seg_sentences 列表中的元素是 jieba.posseg.Pair 类型的对象，而不是字符串类型，因此需要将其转换为字符串类型再进行操作。修改 file2sentences_CCl_pos() 函数的代码如下：

def file2sentences_CCl_pos(file_path):
    with open(file_path,'r',encoding='gb18030', errors='ignore') as fo:
        raw_lines = fo.readlines()
    cleaned_paras = [para2sentences_ccl(para) for para in raw_lines if para is not None]
    seg_sentences = [[str(pair.flag) for pair in jieba.posseg.lcut(sentence)] for sentence in cleaned_paras]
    names = os.path.split(file_path)
    txt_name = 'out'+names[1]
    txt_path = os.path.join(names[0], txt_name)
    with open(txt_path, 'w', encoding='utf8') as txt:
        for flag_sentence in seg_sentences:
            txt.write('\n'.join(flag_sentence) + '\n')
    return seg_sentences

在这个函数中，首先将每个句子切分成词性标记和词语两部分，然后将词性标记转换为字符串类型，最后将词性标记写入到文件中。注意，在打开文件时，需要将模式设置为 'w'，表示写文件。
如果我的建议对您有帮助、请点击采纳、祝您生活愉快

不知道你这个问题是否已经解决, 如果还没有解决的话:

你可以参考下这个问题的回答, 看看是否对你有帮助, 链接: https://ask.csdn.net/questions/7585883
你也可以参考下这篇文章：python代码：读取指定文件夹内的图片并保存到txt文件中
除此之外, 这篇博客: Python 基于jieba的三国演义词频分析与可视化中的 获取本地txt文件内容 部分也许能够解决你的问题, 你可以仔细阅读以下内容或者直接跳转源博客中阅读:
```
def getText(filepath):
    f = open(filepath, 'r', encoding = 'utf-8')
    text = f.read()
    f.close()
    return text #返回文本内容
```
您还可以看一下赵帅老师的Python爬虫基础&商业案例实战课程中的批量生成舆情报告准备知识点：自动生成txt文件小节, 巩固相关知识点

如果你已经解决了该问题, 非常希望你能够分享一下解决方案, 写成博客, 将相关链接放在评论区, 以帮助更多的人 ^-^