组合布隆过滤器

我使用布隆过滤器来检查集合中的重复数据。但是，需要将两组数据的结果组合成一个过滤器，以检查两组数据之间是否存在重复。我设计的伪Python的一个函数来执行此任务：组合布隆过滤器

def combine(a : bloom_filter, b : bloom_filter): 
    assert a.length == b.length 
    assert a.hashes == b.hashes 

    c = new bloom_filter(length = a.length, hashes = b.hashes) 
    c.attempts = a.attempts + b.attempts 
    c.bits = a.bits | b.bits 

    # Determining the amount of items 
    a_and_b = count(a & b) 
    a_not_b = count(a & !b) 
    not_a_b = count(!a & b) 
    neither = count(!a & !b) 
    c.item_count = a_not_b/a.length * a.item_count 
       + not_a_b/b.length * b.item_count 
       + a_and_b/c.length * min(a.item_count, b.item_count) 

    return c

这是否听起来甚至正确？关于是否甚至有可能做我想做的事情，我有很大的内部争议，因为关于源数据的大部分信息都丢失了（这是布隆过滤器的要点）。

来源

2011-05-23 Travis Gockel

可以推导出公式用于估计物品的量的布隆过滤器：

c = log(z/N)/((h * log(1 - 1/N)) 

N: Number of bits in the bit vector 
h: Number of hashes 
z: Number of zero bits in the bit vector

这提供在布隆过滤器的项目数的相当准确的估计。你可以用简单的减法来估计贡献。

来源

2011-06-05 18:48:55

，便有可能.....那种..

可以说，设置一个包含苹果和桔子

可以说集B含有豌豆和胡萝卜

构建一个简单的16位布隆过滤器作为一个例子和CRC32作为哈希

crc32(apples) = 0x70CCB02F 

crc32(oranges) = 0x45CDF3B4 

crc32(peas) = 0xB18D0C2B 

crc32(carrots) = 0x676A9E28

开始瓦特/空布隆过滤器（BF）（比方说16个比特）为两组（A，B）

BFA = BFB = 0000 0000 0000 0000

然后，把hash分成一些比特长度，我们将在这里使用4这里我们可以给BF添加苹果。例如

Get Apples BF Index list by splitting up the hash: 

0x70CCB02F = 0111 0000 1100 1100 1011 0000 0010 1111 
      7  0 C C B  0 2  F  
---------------------------------------------------- 

Add Apples to BFA by setting BF bit indexes [ 7, 0, 12, 12, 11, 0, 2, 15] 

           (set the index bit of an empty BF to 1) 
Apples =  1001 1000 1000 0101 (<- see indexes 0,2,7,11,12,15 are set) 
BF =   0000 0000 0000 0000 (or operation adds that item to the BF) 
================================ 
Updated BFA = 1001 1000 1000 0101

加入橙子至bf同样的方式：

0x45CDF3B4 = 0100 0101 1100 1101 1111 0011 1011 0100 
       4 5 12 13 15 3 11 4 
---------------------------------------------------- 
Add oranges to BF by setting BF bit indexes [ 4,5,12,13,15,3,11,4] 

Oranges =  1011 1000 0011 1000 
BFA =   1001 1000 1000 0101 (or operation) 
================================ 
Updated BFA = 1011 1000 1011 1101

所以，现在苹果和桔子插入BF1 W的1011 1000 1011 1101

/终值执行相同的BFB

crc32(peas) = 0xB18D0C2B becomes => 
set [11,2,12,0,13,1,8] in BFB 
0011 1001 0000 0011 = BF(peas) 

crc32(carrots) = 0x676A9E28 becomes => 
set [8,2,14,9,10,6,7] in BFB 

0100 0111 1100 0100 = BF(carrots) 

so BFB = 
0011 1001 0000 0011 BF(peas) 
0100 0111 1100 0100 BF(carrots) 
=================== ('add' them to BFB via locial or op) 
0111 1111 1100 0111

您现在可以在循环中搜索B中的条目，也可以搜索B中的副词：

确实乙包含“桔子” =>

1011 1000 0011 1000 (Oranges BF representation) 
0111 1111 1100 0111 (BFB) 
=====================  (and operation) 
0011 1000 0000 0000

因为这个结果(0011 1000 0000 0000)不匹配橙子的原BF，你可以肯定的是，B不包含任何橘子

......（对于其余物品）

以下，B不包含任何A项，就像B不包含任何苹果一样。

我不认为这就是你问的，但看起来你可以计算机上的差异高炉，这是更重要的。好像你可以做一个异或运算，并且会给你同时包含差异的“单”数组：

0111 1111 1100 0111 (BFB) 
1011 1000 1011 1101 (BFA) 
======================== 
1100 0111 0111 1010 (BFA xor BFB) == (items in B not in A, and items in A not in B)

与此单BF的意思，你可以检测项目的非脑干的时间100％，只是不存在该项目的100％。

你会使用它的方式，如下（检查是否豆豆“从A丢失）：

1100 0111 0111 1010 (BFA xor BFB) 
0011 1001 0000 0011 (Peas) 
============================== (And operation) 
0000 0001 0000 0010 (non-zero)

因为(BFA xor BFB) && (Peas) != 0你知道一集不包含‘豌豆’......

再次，你会逐项测试，也许你可以做聚合，但可能不是一个好主意......

希望这有助于！

来源

2012-11-15 21:49:55

组合布隆过滤器

回答

相关问题