编程实现:支持局部敏感哈希的布隆过滤器

img

python编程实现:sensitive keyword unigram vector2000.csv文件和nonsensitive keyword unigram vector2000.csv文件里分别存放有2000个文件的不同信息,每行四个单元格的值分别表示一个文件中含有的四个单词的向量,先要求用所给的p稳定分布的局部敏感哈希的k个哈希函数处理每个文件的这些向量,生成每个文件的布隆过滤器(由于布隆过滤器的性质,理想情况下每个向量会在布隆过滤器中有k个bit为1的位置)。

p稳定分布的局部敏感哈希请参考p-stable-lsh-python-main项目文件,布隆过滤器和哈希函数参考BloomFilter-master项目文件。

思路是把p-stable-lsh-python-main项目文件中的哈希函数k个ha,b(v)应用到BloomFilter-master项目文件中,BloomFilter-master项目文件原有的哈希函数都可以不要,然后对每行的信息都分别生成一个布隆过滤器,用BloomFilter-master项目文件里的insert函数插入csv文件中的每行4个向量,一个文件即一行对应一个布隆过滤器命名为“序号.bin”,如第一行叫1.bin。可以以k=3为例,即有三个哈希函数ha1,b1(v),ha2,b2(v)和ha3,b3(v),哈希函数的信息在p-stable-lsh-python-main项目文件,目前不清楚可不可以生成多个函数,需要测试。问题难度不确定,可以追加¥有意向的私。

测试文件:有两个,以sensitive keyword unigram vector2000.csv为例
第一行:
1,0,1,0,1,0,1,0,0,0,1,1,0,0,1,0,0,1,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,1,0,0,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,1,0,1,1,0,0,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,1,1,0,1,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

期望输出结果1:1.bin-2000.bin,需要能通过项目里is_contain函数测试

读取csv文件的问题可以帮你解决

from csv import reader
import numpy as np

if __name__ == '__main__':
    # 这里的文件路径根据自己放的位置,进行修改
    with open('nonsensitive keyword unigram vector2000.csv', 'r', encoding='utf-8') as f:
        # 按行读取,装入list
        data = list(reader(f))
    # 全部数据读取完后,转为numpy数组
    data = np.array(data)
    # 取第一行
    print(data[0])
    # 取第一行第一列
    print(data[0][0])
    # 取第一行第一列的数据,去除逗号
    print(data[0][0].replace(",", ""))
    # 取第一行第一列的数据,去除逗号后的长度
    print(len(data[0][0].replace(",", "")))

该链接仅为参考,旨在你编写程序或纠正、查找错误时使用,增强你的理解:https://xdrush.github.io/2017/08/09/%E5%B1%80%E9%83%A8%E6%95%8F%E6%84%9F%E5%93%88%E5%B8%8C/
【局部敏感哈希算法原理及其应用】【 LSH(局部敏感哈希)算法】

希望这个有帮助
https://b23.tv/TCLaPg1