it.unimi.dsi.sux4j.mph
Class MinimalPerfectHashFunction<T>

java.lang.Object
  extended by it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction<K>
      extended by it.unimi.dsi.sux4j.mph.AbstractHashFunction<T>
          extended by it.unimi.dsi.sux4j.mph.MinimalPerfectHashFunction<T>
All Implemented Interfaces:
Function<T,Long>, Object2LongFunction<T>, Serializable

public class MinimalPerfectHashFunction<T>
extends AbstractHashFunction<T>
implements Serializable

A minimal perfect hash function.

Given a list of elements without duplicates, the constructors of this class finds a minimal perfect hash function for the list. Subsequent calls to the getLong(Object) method will return a distinct number for each elements in the list. For elements out of the list, the resulting number is not specified. In some (rare) cases it might be possible to establish that an element was not in the original list, and in that case -1 will be returned. The class can then be saved by serialisation and reused later.

The theoretical memory requirements are 2γ=2.46 + o(n) bits per element, plus the bits for the random hashes (which are usually negligible). The o(n) part is due to an embedded ranking scheme that increases space occupancy by 0.625%, bringing the actual occupied space to around 2.65 bits per element.

As a commodity, this class provides a main method that reads from standard input a (possibly gzip'd) sequence of newline-separated strings, and writes a serialised minimal perfect hash function for the given list.

How it Works

The technique used is very similar (but developed independently) to that described by Botelho, Pagh and Ziviani in “Simple and Efficient Minimal Perfect Hashing Functions”, Proc. WADS 2007, number 4619 of Lecture Notes in Computer Science, pages 139−150, 2007. In turn, the mapping technique described therein was actually first proposed by Chazelle, Kilian, Rubinfeld and Tal in “The Bloomier Filter: an Efficient Data Structure for Static Support Lookup Tables”, Proc. SODA 2004, pages 30−39, 2004, as one of the steps to implement a mutable table.

The basic ingredient is the Majewski-Wormald-Havas-Czech 3-hypergraph technique. After generating a random 3-hypergraph with suitably sorted edges, we assign to each vertex a two-bit number in such a way that for each 3-hyperedge the sum of the values associated to its vertices modulo 3 gives the index of the hash function generating the hinge. The hinge is distinct for each 3-hyperedge, and as a result we obtain a perfect hash of the original set (one just has to compute the three hash functions, collect the three two-bit values, add them modulo 3 and take the corresponding hash function value).

To obtain a minimal perfect hash, we simply notice that we whenever we have to assign a value to a vertex, we can take care of using the number 3 instead of 0 if the vertex is actually the output value for some element. As a result, the final value of the minimal perfect hash function is the number of nonzero pairs of bits that precede the perfect hash value for the element. To compute this number, we use a simple table-free ranking scheme, recording the number of nonzero pairs each BITS_PER_BLOCK bits and modifying the standard broadword algorithm for computing the number of ones in a word into an algorithm that counts the number of nonzero pairs of bits in a word.

Since:
0.1
Author:
Sebastiano Vigna
See Also:
Serialized Form

Field Summary
protected  long[] array
          The bit array supporting values.
static int BITS_PER_BLOCK
          The number of bits per block in the rank structure.
protected  LongArrayBitVector bitVector
          The bit vector underlying values.
protected  int[] count
          The number of nonzero bit pairs up to a given block of BITS_PER_BLOCK bits.
protected  long globalSeed
          The seed used to generate the initial hash triple.
static int LOG2_CHUNK_SIZE
          The logarithm of the desired chunk size.
protected  int n
          The number of elements.
protected  int[] offset
          The start offset of each block.
protected  long[] seed
          The seed of the underlying 3-hypergraphs.
static long serialVersionUID
           
protected  TransformationStrategy<? super T> transform
          The transformation strategy.
protected  LongBigList values
          The final magick—the list of modulo-3 values that define the output of the minimal hash function.
 
Fields inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction
defRetValue
 
Constructor Summary
  MinimalPerfectHashFunction(Iterable<? extends T> elements, TransformationStrategy<? super T> transform)
          Creates a new minimal perfect hash table for the given elements.
  MinimalPerfectHashFunction(Iterable<? extends T> elements, TransformationStrategy<? super T> transform, ChunkedHashStore<T> chunkedHashStore)
          Creates a new minimal perfect hash table for the given elements.
protected MinimalPerfectHashFunction(MinimalPerfectHashFunction<T> mph)
          Creates a new minimal perfect hash by copying a given one; non-transient fields are (shallow) copied.
 
Method Summary
static int countNonzeroPairs(long x)
          Counts the number of nonzero pairs of bits in a long.
 long getLong(Object key)
           
 long getLongByTriple(long[] triple)
           
static void main(String[] arg)
           
 long numBits()
          Returns the number of bits used by this structure.
 int rank(long x)
           
 int size()
           
 
Methods inherited from class it.unimi.dsi.sux4j.mph.AbstractHashFunction
containsKey
 
Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction
clear, defaultReturnValue, defaultReturnValue, get, put, put, remove, removeLong
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

serialVersionUID

public static final long serialVersionUID
See Also:
Constant Field Values

BITS_PER_BLOCK

public static final int BITS_PER_BLOCK
The number of bits per block in the rank structure.

See Also:
Constant Field Values

LOG2_CHUNK_SIZE

public static final int LOG2_CHUNK_SIZE
The logarithm of the desired chunk size.

See Also:
Constant Field Values

n

protected final int n
The number of elements.


globalSeed

protected final long globalSeed
The seed used to generate the initial hash triple.


seed

protected final long[] seed
The seed of the underlying 3-hypergraphs.


offset

protected final int[] offset
The start offset of each block.


values

protected final LongBigList values
The final magick—the list of modulo-3 values that define the output of the minimal hash function.


bitVector

protected final LongArrayBitVector bitVector
The bit vector underlying values.


array

protected transient long[] array
The bit array supporting values.


count

protected final int[] count
The number of nonzero bit pairs up to a given block of BITS_PER_BLOCK bits.


transform

protected final TransformationStrategy<? super T> transform
The transformation strategy.

Constructor Detail

MinimalPerfectHashFunction

public MinimalPerfectHashFunction(Iterable<? extends T> elements,
                                  TransformationStrategy<? super T> transform)
                           throws IOException
Creates a new minimal perfect hash table for the given elements.

Parameters:
elements - the elements to hash.
transform - a transformation strategy for the elements.
Throws:
IOException

MinimalPerfectHashFunction

public MinimalPerfectHashFunction(Iterable<? extends T> elements,
                                  TransformationStrategy<? super T> transform,
                                  ChunkedHashStore<T> chunkedHashStore)
                           throws IOException
Creates a new minimal perfect hash table for the given elements.

Parameters:
elements - the elements to hash.
transform - a transformation strategy for the elements.
Throws:
IOException

MinimalPerfectHashFunction

protected MinimalPerfectHashFunction(MinimalPerfectHashFunction<T> mph)
Creates a new minimal perfect hash by copying a given one; non-transient fields are (shallow) copied.

Parameters:
mph - the perfect hash to be copied.
Method Detail

numBits

public long numBits()
Returns the number of bits used by this structure.

Returns:
the number of bits used by this structure.

countNonzeroPairs

public static int countNonzeroPairs(long x)
Counts the number of nonzero pairs of bits in a long.

This method uses a very simple modification of the classical broadword algorithm to count the number of ones in a word. Usually, in the first step of the algorithm one computes the number of ones in each pair of bits. Instead, we compute one iff at least one of the bits in the pair is nonzero.

Parameters:
x - a long.
Returns:
the number of nonzero bit pairs in x.

rank

public int rank(long x)

getLong

public long getLong(Object key)
Specified by:
getLong in interface Object2LongFunction<T>

getLongByTriple

public long getLongByTriple(long[] triple)

size

public int size()
Specified by:
size in interface Function<T,Long>
Overrides:
size in class AbstractHashFunction<T>

main

public static void main(String[] arg)
                 throws NoSuchMethodException,
                        IOException,
                        JSAPException
Throws:
NoSuchMethodException
IOException
JSAPException