it.unimi.dsi.util
Class BloomFilter

java.lang.Object
  extended by it.unimi.dsi.util.BloomFilter
All Implemented Interfaces:
Serializable

public class BloomFilter
extends Object
implements Serializable

A Bloom filter.

Instances of this class represent a set of character sequences or primitive-type arrays (with false positives) using a Bloom filter. Because of the way Bloom filters work, you cannot remove elements.

The intended usage is that character sequences and arrays should not be mixed (albeit in principle it could work). A Bloom filter is rather agnostic with respect to the data it contains—all it needs is a sequence of hash functions that can be applied to the data.

Bloom filters have an expected error rate, depending on the number of hash functions used, on the filter size and on the number of elements in the filter. This implementation uses a variable optimal number of hash functions, depending on the expected number of elements. More precisely, a Bloom filter for n character sequences with d hash functions will use ln 2 dn ≈ 1.44 dn bits; false positives will happen with probability 2-d. The maximum number of bits supported is MAX_BITS.

Hash functions are generated at creation time using a mix of universal hashing and shift-add-xor hashing (for the latter, see “Performance in practice of string hashing functions”, by M.V. Ramakrishna and Justin Zobel, Proc. of the Fifth International Conference on Database Systems for Advanced Applications, 1997, pages 215−223).

Each hash function uses NUMBER_OF_WEIGHTS random integers, which are cyclically multiplied by the character codes in a character sequence. The resulting integers are then summed with a shifted hash and XOR-ed together.

This class exports access methods that are similar to those of Set, but it does not implement that interface, as too many non-optional methods would be unimplementable (e.g., iterators).

A main method makes it easy to create serialised Bloom filters starting from a list of terms.

Author:
Sebastiano Vigna
See Also:
Serialized Form

Field Summary
 int d
          The number of hash functions used by this filter.
 long m
          The number of bits in this filter.
static long MAX_BITS
          The maximum number of bits in a filter (limited by array size and bits in a long).
static int NUMBER_OF_WEIGHTS
          The number of weights used to create hash functions.
 
Constructor Summary
BloomFilter(int n)
          Creates a new high-precision Bloom filter a given expected number of elements.
BloomFilter(int n, int d)
          Creates a new Bloom filter with given number of hash functions and expected number of elements.
 
Method Summary
 boolean add(byte[] a)
          Adds a byte array to the filter.
 boolean add(char[] a)
          Adds a character array to the filter.
 boolean add(CharSequence s)
          Adds a character sequence to the filter.
 boolean add(double[] a)
          Adds a double array to the filter.
 boolean add(float[] a)
          Adds a float array to the filter.
 boolean add(int[] a)
          Adds an int array to the filter.
 boolean add(long[] a)
          Adds a long array to the filter.
 boolean add(short[] a)
          Adds a short array to the filter.
 void clear()
          Clears this filter.
 boolean contains(byte[] a)
          Checks whether the given byte array is in this filter.
 boolean contains(char[] a)
          Checks whether the given character array is in this filter.
 boolean contains(CharSequence s)
          Checks whether the given character sequence is in this filter.
 boolean contains(double[] a)
          Checks whether the given double array is in this filter.
 boolean contains(float[] a)
          Checks whether the given float array is in this filter.
 boolean contains(int[] a)
          Checks whether the given int array is in this filter.
 boolean contains(long[] a)
          Checks whether the given long array is in this filter.
 boolean contains(short[] a)
          Checks whether the given short array is in this filter.
static void main(String[] arg)
           
 long size()
          Returns the size of this filter.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MAX_BITS

public static final long MAX_BITS
The maximum number of bits in a filter (limited by array size and bits in a long).

See Also:
Constant Field Values

NUMBER_OF_WEIGHTS

public static final int NUMBER_OF_WEIGHTS
The number of weights used to create hash functions.

See Also:
Constant Field Values

m

public final long m
The number of bits in this filter.


d

public final int d
The number of hash functions used by this filter.

Constructor Detail

BloomFilter

public BloomFilter(int n)
Creates a new high-precision Bloom filter a given expected number of elements.

This constructor uses a number of hash functions that is logarithmic in the number of expected elements. This usually results in no false positives at all.

Parameters:
n - the expected number of elements.

BloomFilter

public BloomFilter(int n,
                   int d)
Creates a new Bloom filter with given number of hash functions and expected number of elements.

Parameters:
n - the expected number of elements.
d - the number of hash functions; under obvious uniformity and indipendence assumptions, if the filter has not more than n elements, false positives will happen with probability 2-d.
Method Detail

contains

public boolean contains(CharSequence s)
Checks whether the given character sequence is in this filter.

Note that this method may return true on a character sequence that has not been added to the filter. This will happen with probability 2-d, where d is the number of hash functions specified at creation time, if the number of the elements in the filter is less than n, the number of expected elements specified at creation time.

Parameters:
s - a character sequence.
Returns:
true if s (or some element with the same hash sequence as s) is in the filter.

contains

public boolean contains(byte[] a)
Checks whether the given byte array is in this filter.

Parameters:
a - a byte array.
Returns:
true if a (or some element with the same hash sequence as a) is in the filter.
See Also:
contains(CharSequence)

contains

public boolean contains(short[] a)
Checks whether the given short array is in this filter.

Parameters:
a - a short array.
Returns:
true if a (or some element with the same hash sequence as a) is in the filter.
See Also:
contains(CharSequence)

contains

public boolean contains(char[] a)
Checks whether the given character array is in this filter.

Parameters:
a - a character array.
Returns:
true if a (or some element with the same hash sequence as a) is in the filter.
See Also:
contains(CharSequence)

contains

public boolean contains(int[] a)
Checks whether the given int array is in this filter.

Parameters:
a - an int array.
Returns:
true if a (or some element with the same hash sequence as a) is in the filter.
See Also:
contains(CharSequence)

contains

public boolean contains(long[] a)
Checks whether the given long array is in this filter.

Parameters:
a - a long array.
Returns:
true if a (or some element with the same hash sequence as a) is in the filter.
See Also:
contains(CharSequence)

contains

public boolean contains(float[] a)
Checks whether the given float array is in this filter.

Parameters:
a - a float array.
Returns:
true if a (or some element with the same hash sequence as a) is in the filter.
See Also:
contains(CharSequence)

contains

public boolean contains(double[] a)
Checks whether the given double array is in this filter.

Parameters:
a - a double array.
Returns:
true if a (or some element with the same hash sequence as a) is in the filter.
See Also:
contains(CharSequence)

add

public boolean add(CharSequence s)
Adds a character sequence to the filter.

Parameters:
s - a character sequence.
Returns:
true if this filter was modified (i.e., neither s nor any other element with the same hash sequence as s was already in this filter).

add

public boolean add(byte[] a)
Adds a byte array to the filter.

Parameters:
a - a byte array.
Returns:
true if this filter was modified (i.e., neither a nor any other element with the same hash sequence as a was already in this filter).

add

public boolean add(short[] a)
Adds a short array to the filter.

Parameters:
a - a short array.
Returns:
true if this filter was modified (i.e., neither a nor any other element with the same hash sequence as a was already in this filter).

add

public boolean add(char[] a)
Adds a character array to the filter.

Parameters:
a - a character array.
Returns:
true if this filter was modified (i.e., neither a nor any other element with the same hash sequence as a was already in this filter).

add

public boolean add(int[] a)
Adds an int array to the filter.

Parameters:
a - an int array.
Returns:
true if this filter was modified (i.e., neither a nor any other element with the same hash sequence as a was already in this filter).

add

public boolean add(long[] a)
Adds a long array to the filter.

Parameters:
a - a long array.
Returns:
true if this filter was modified (i.e., neither a nor any other element with the same hash sequence as a was already in this filter).

add

public boolean add(float[] a)
Adds a float array to the filter.

Parameters:
a - a float array.
Returns:
true if this filter was modified (i.e., neither a nor any other element with the same hash sequence as a was already in this filter).

add

public boolean add(double[] a)
Adds a double array to the filter.

Parameters:
a - a double array.
Returns:
true if this filter was modified (i.e., neither a nor any other element with the same hash sequence as a was already in this filter).

clear

public void clear()
Clears this filter.


size

public long size()
Returns the size of this filter.

Note that the size of a Bloom filter is only a lower bound for the number of distinct elements that have been added to the filter. False positives might make the number returned by this method smaller than it should be.

Returns:
the size of this filter.

main

public static void main(String[] arg)
                 throws IOException,
                        JSAPException,
                        NoSuchMethodException
Throws:
IOException
JSAPException
NoSuchMethodException