用spacy分词构建元学习数据集时遇到的分词问题

论文FEW-SHOT TEXT CLASSIFICATION WITH DISTRIBUTIONAL SIGNATURES中的Amazon数据集由text raw label组成,text是raw的分词结果,以列表形式存储,这是其中一个样本的text:

list1 = ['i', 'was', 'pleasantly', 'surprised', 'with', 'this', '"', 'out', 'of', 'the', 'box', '"', 'series', '.', ' ', 'good', 'writing', ',', 'good', 'acting', ',', 'laugh', 'out', 'loud', 'situations', '.', ' ', 'devito', 'showing', 'up', 'in', 'the', 'second', 'season', 'gave', 'it', 'a', 'little', 'boost', 'as', 'he', "'s", 'always', 'dependable', 'for', 'turning', 'the', 'mundane', 'into', 'the', 'hilarious', '.', 'it', "'s", 'basically', 'about', '3', 'jackass', 'friends', 'in', 'philly', 'who', 'own', 'a', 'bar', 'and', 'get', 'themselves', 'into', 'offbeat', 'situations', '.', ' ', 'what', 'i', 'liked', 'best', 'is', 'that', 'it', 'is', 'not', 'the', 'clice', 'venue', 'for', 'the', 'young', 'and', 'the', 'beautiful', '.', ' ', 'it', 'often', 'hi', '-', 'lightes', 'the', 'old', 'and', 'the', 'ugly', 'and', 'in', 'doing', 'so', 'cultivates', 'a', 'good', 'portion', 'of', 'the', 'laughs', '.', 'worth', 'you', 'time', 'and', 'money', '....', 'bg']
论文中没有写用的什么分词方法
这是我用spacy的en_core_web_sm对raw分词得到的结果
list2 = ['i', 'was', 'pleasantly', 'surprised', 'with', 'this', '"', 'out', 'of', 'the', 'box', '"', 'series', '.', ' ', 'good', 'writing', ',', 'good', 'acting', ',', 'laugh', 'out', 'loud', 'situations', '.', ' ', 'devito', 'showing', 'up', 'in', 'the', 'second', 'season', 'gave', 'it', 'a', 'little', 'boost', 'as', 'he', "'s", 'always', 'dependable', 'for', 'turning', 'the', 'mundane', 'into', 'the', 'hilarious.it', "'s", 'basically', 'about', '3', 'jackass', 'friends', 'in', 'philly', 'who', 'own', 'a', 'bar', 'and', 'get', 'themselves', 'into', 'offbeat', 'situations', '.', ' ', 'what', 'i', 'liked', 'best', 'is', 'that', 'it', 'is', 'not', 'the', 'clice', 'venue', 'for', 'the', 'young', 'and', 'the', 'beautiful', '.', ' ', 'it', 'often', 'hi', '-', 'lightes', 'the', 'old', 'and', 'the', 'ugly', 'and', 'in', 'doing', 'so', 'cultivates', 'a', 'good', 'portion', 'of', 'the', 'laughs.worth', 'you', 'time', 'and', 'money', '....', 'bg']

所有不匹配的分词结果都是单词中包含'.',类似hilarious.it,c.g.i
我现在想把spacy分词结果中的包含的'.'的单词手动分开,但是会出现影响到其他只包含'.'的字符串,并没有找到很好的手动分割方法
或者是不是有更合适的分词方法,能直接得到text的结果

其实主要是空格的问题 有的句子连接处没打空格就下一句了 就会合在一起 看我的elif里面的内容 解决了这个问题


import spacy
import re
list1 = []
nlp = spacy.load("en_core_web_sm")
str1 = "I was pleasantly surprised with this \"out of the box\" series.  Good writing, good acting, laugh out loud situations.  Devito showing up in the second season gave it a little boost as he's always dependable for turning the mundane into the hilarious.It's basically about 3 jackass friends in Philly who own a bar and get themselves into offbeat situations.  What I liked best is that it is not the clice venue for the young and the beautiful.  It often hi-lightes the old and the ugly and in doing so cultivates a good portion of the laughs.Worth you time and money....bg"
doc = nlp(str1.lower())
for token in doc:
    if str(token)=='"':
        list1.append(str("\""))
    elif '.' in str(token) and str(token).count('.')!=len(str(token)):
        for x in re.findall(r'\w+|\.',str(token)):
            list1.append(x)
    else:
        list1.append(str(token))