编程实现:1-gram sequence、uni-gram set和uni-gram vector

不限语言编程实现:choose a 1-gram sequence to parse a keyword, we name this representation as the uni-gram set. For example, the keyword “secure” is transformed to {s1, e1, c1, u1, r1, e2}, where “e1” is the first “e” in “secure” and “e2” is the second “e”. The uni-gram set is presented with a 160-bit long vector which named the uni-gram vector. The uni-gram vector consists of 26 ∗ 5 + 30 bits, where 26∗5 bits represent 26∗5 letters, 30 bits represent symbols and numbers those are in common use. A given bit is set to 1 if it characterizes a corresponding uni-gram; otherwise it remains 0.

题目翻译:选择一个1-gram的序列来解析一个关键字,我们将这个表示法命名为uni-gram set。例如,关键字“secure”转换为集合{s1、e1、c1、u1、r1、e2},其中“e1”是“secure”中的第一个“e”,“e2”是第二个“e”。uni-gram set被表示为一个160位长的向量,它被命名为uni-gram vector。单克向量由26∗5 + 30位组成,其中26∗5位代表26∗5个字母,30位表示常用的符号和数字。如果uni-gram vector中的一个给定的bit位描述了一个相应的uni-gram,则它被设置为1;否则它保持0。

测试文件:keyword.txt
stategov
selfempnotinc
federalgov
localgov
priv

期望输出结果1:uni-gram set.txt
s1,t1,a1,t2,e1,g1,o1,v1,
s1,e1,l1,f1,e2,m1,p1,n1,o1,t1,i1,n2,c1
f1,e1,d1,e2,r1,a1,l1,g1,o1,v1
l1,o1,c1,a1,l2,g1,o2,v1
p1,r1,i1,v1

期望输出结果2:uni-gram vector.txt
{1,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...}
{0,0,1,0,1,1,0,0,1,0,0,1,1,1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,...}
{1,0,0,1,1,1,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...}
{1,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...}
{0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...}

30位常用符号和数字,不知道怎么对应位置顺序,目前只处理了全是英文字母的情况

if __name__ == '__main__':
    # 读取keyword.txt处理
    uni_gram_list = []
    with open("keyword.txt", "r", encoding="utf-8") as f:
        text_line_list = f.read().splitlines()
    for text in text_line_list:
        uni_gram_dict = {}
        uni_gram_item_list = []
        for c in text:
            if c.isalpha():
                if c not in uni_gram_dict.keys():
                    uni_gram_dict[c] = 1
                else:
                    uni_gram_dict[c] = uni_gram_dict[c] + 1
                uni_gram_item_list.append(c + str(uni_gram_dict[c]))
        uni_gram_list.append(uni_gram_item_list)
    # uni-gram set.txt输出处理
    f = open('uni-gram set.txt', 'w')
    for line in uni_gram_list:
        f.write(','.join(line)+'\n')
    f.close()
    # uni-gram vector.txt输出处理
    f = open('uni-gram vector.txt', 'w')
    uni_gram_vector_list = [[0 for j in range(160)] for i in range(len(uni_gram_list))]
    for index, value in enumerate(uni_gram_list):
        # 用ascill码处理
        uni_gram_list[index] = sorted(list(map(lambda x: ord(x[0:1]) - 97 + (int(x[1:2]) - 1) * 26, value)))
        for i, v in enumerate(uni_gram_list[index]):
            uni_gram_vector_list[index][v] = 1
    for line in uni_gram_vector_list:
        f.write(','.join(list(map(str, line))) + '\n')
    f.close()
    print("success")