it.unimi.dsi.sux4j.io
Class ChunkedHashStore<T>

java.lang.Object
  extended by it.unimi.dsi.sux4j.io.ChunkedHashStore<T>
All Implemented Interfaces:
SafelyCloseable, Closeable, Serializable, Iterable<ChunkedHashStore.Chunk>

public class ChunkedHashStore<T>
extends Object
implements Serializable, SafelyCloseable, Iterable<ChunkedHashStore.Chunk>

A temporary store of hash triples virtually divided into chunks.

A chunked hash store accumulates objects of type T by turning them into bit vectors (using a provided TransformationStrategy) and then hashing such vectors into a triple of longs (i.e., overall we get a hash of 192 bits). Elements can be added one by one or in batches. Besides the hashes, we store the offset of each element added (the first element added has offset 0, the second one has offset 1, and so on). Elements must be distinct, or more precisely, they must be transformed into distinct bit vectors.

Once all elements have been added, they can be gathered into chunks whose tentative size can be set by calling log2Chunks(int). More precisely, if the latter method is called with argument k, then each chunk will be formed by grouping triples by the k most significant bits of their first hash.

To obtain triples, one calls iterator(), which returns chunks one at a time (in their natural order); triples within each chunk are returned by increasing hash. Actually, the iterator provided by a chunk returns a quadruple whose last element is the offset of the element that generated the triple.

It is possible (albeit unlikely) that different elements generate the same hash. This event is detected during chunk iteration (not while accumulating hashes), and it will throw a ChunkedHashStore.DuplicateException. At that point, the caller must handle the exception by resetting the store ant trying again from scratch. Note that after a few (say, three) exceptions you can assume that there are duplicate elements. If you need to force a check on the whole store you can call check(). If all your elements come from an Iterable, checkAndRetry(Iterable) will try three times to build a checked chunked hash store.

Note that every reset(long) changes the seed used by the store to generate triples. So, if this seed has to be stored this must happen after the last call to reset(long). To help tracking this fact, a call to seed() will lock the store; any further call to reset(long) will throw an IllegalStateException. In case the store needs to be reused, you can call clear(), that will bring back the store to after-creation state.

When you have finished using a chunked hash store, you should close() it. This class implements SafelyCloseable, and thus provides a safety-net finalizer.

Filtering

You can at any time set a predicate that will filter the triples returned by the store.

Implementation details

Internally, a chunked hash store has a notion of disk chunk: triples are stored on disk using a fixed number of bits. Once the user chooses a chunk size, the store exhibits the data on disk by grouping disk chunks or splitting them in a suitable way. This process is transparent to the user.

An instance of this class will save triples into DISK_CHUNKS disk chunks. Triples have to be loaded into memory only chunk by chunk, so to be sorted and tested for uniqueness. As long as DISK_CHUNKS is larger than eight, the store will need less than one bit per element of main memory. DISK_CHUNKS can be increased arbitrarily at compile time, but each store will open DISK_CHUNKS files at the same time. (For the same reason, it is strongly suggested that you close your stores as soon as you do not need them).

Intended usage

Chunked hash stores should be built by classes that need to manipulate elements in chunks of approximate given size without needing access to the elements themselves, but just to their triples, a typical example being MinimalPerfectHashFunction, which uses the triples to compute a 3-hyperedge. Once a chunked hash store is built, it can be passed on to further substructures, reducing greatly the computation time (as the original collection need not to be scanned again).

To compute the chunk corresponding to given element, use

 final long[] h = new long[ 3 ];
 Hashes.jenkins( transform.toBitVector( key ), seed, h );
 final int chunk = chunkShift == Long.SIZE ? 0 : (int)( h[ 0 ] >>> chunkShift );
 
where seed is the store seed, and chunkShift is the return value of log2Chunks(int) and should be stored by the caller. Note that you do not need the chunked hash store to compute these data, but just its seed and the chunk shift.

Since:
1.0.4
Author:
Sebastiano Vigna
See Also:
Serialized Form

Nested Class Summary
static class ChunkedHashStore.Chunk
          A chunk returned by a ChunkedHashStore.
static class ChunkedHashStore.DuplicateException
          Denotes that the chunked hash store contains a duplicate hash triple.
 
Field Summary
static int DISK_CHUNKS
          The number of physical disk chunks.
static int DISK_CHUNKS_SHIFT
          The shift for physical disk chunks.
static int LOG2_DISK_CHUNKS
          The logarithm of the number of physical disk chunks.
protected  int n
          The number of elements that pass the current filter.
protected  long seed
          The seed used to generate the hash triples.
static long serialVersionUID
           
 
Constructor Summary
ChunkedHashStore(TransformationStrategy<? super T> transform)
          Creates a chunked hash store with given transformation strategy.
ChunkedHashStore(TransformationStrategy<? super T> transform, ProgressLogger pl)
          Creates a chunked hash store with given transformation strategy and progress logger.
 
Method Summary
 void add(T o)
          Adds an element to this store.
 void addAll(Iterator<? extends T> iterator)
          Adds the elements returned by an iterator to this store.
 void check()
          Checks that this store has no duplicate triples, throwing an exception if this fails to happen.
 void checkAndRetry(Iterable<? extends T> iterable)
          Checks that this store has no duplicate triples, and try to rebuild if this fails to happen.
 void clear()
          Clears this store.
 void close()
          Closes this store, disposing all associated resources.
 void filter(org.apache.commons.collections.Predicate filter)
          Sets a filter for this store.
protected  void finalize()
           
 Iterator<ChunkedHashStore.Chunk> iterator()
          Returns an iterator over the chunks of this chunked hash store.
 int log2Chunks(int log2chunks)
          Sets the number of chunks.
 void reset(long seed)
          Resets this store using a new seed.
 long seed()
          Return the current seed of this chunked hash store.
 int size()
          Returns the size of this store.
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

serialVersionUID

public static final long serialVersionUID
See Also:
Constant Field Values

LOG2_DISK_CHUNKS

public static final int LOG2_DISK_CHUNKS
The logarithm of the number of physical disk chunks.

See Also:
Constant Field Values

DISK_CHUNKS

public static final int DISK_CHUNKS
The number of physical disk chunks.

See Also:
Constant Field Values

DISK_CHUNKS_SHIFT

public static final int DISK_CHUNKS_SHIFT
The shift for physical disk chunks.

See Also:
Constant Field Values

n

protected int n
The number of elements that pass the current filter.


seed

protected long seed
The seed used to generate the hash triples.

Constructor Detail

ChunkedHashStore

public ChunkedHashStore(TransformationStrategy<? super T> transform)
                 throws IOException
Creates a chunked hash store with given transformation strategy.

Parameters:
transform - a transformation strategy for the elements.
Throws:
IOException

ChunkedHashStore

public ChunkedHashStore(TransformationStrategy<? super T> transform,
                        ProgressLogger pl)
                 throws IOException
Creates a chunked hash store with given transformation strategy and progress logger.

Parameters:
transform - a transformation strategy for the elements.
pl - a progress logger, or null.
Throws:
IOException
Method Detail

seed

public long seed()
Return the current seed of this chunked hash store. After calling this method, no reset(long) will be allowed (unless the store is cleared).

Returns:
the current seed of this chunked hash store.

add

public void add(T o)
         throws IOException
Adds an element to this store.

Parameters:
o - the element to be added.
Throws:
IOException

addAll

public void addAll(Iterator<? extends T> iterator)
            throws IOException
Adds the elements returned by an iterator to this store.

Parameters:
iterator - an iterator returning elements.
Throws:
IOException

size

public int size()
         throws IOException
Returns the size of this store. Note that if you set up a filter, the first call to this method will require a scan to the whole store.

Returns:
the number of (possibly filtered) triples of this store.
Throws:
IOException

clear

public void clear()
Clears this store. After a call to this method, the store can be reused.


finalize

protected void finalize()
                 throws Throwable
Overrides:
finalize in class Object
Throws:
Throwable

close

public void close()
Closes this store, disposing all associated resources.

Specified by:
close in interface Closeable

reset

public void reset(long seed)
Resets this store using a new seed. All accumulated data are cleared, and a new seed is reinstated.

Parameters:
seed - the new seed.
Throws:
IllegalStateException - if this store was locked by a call to seed(), and never cleared thereafter.

check

public void check()
           throws ChunkedHashStore.DuplicateException
Checks that this store has no duplicate triples, throwing an exception if this fails to happen.

Throws:
ChunkedHashStore.DuplicateException - if this store contains duplicate triples.

checkAndRetry

public void checkAndRetry(Iterable<? extends T> iterable)
                   throws IOException
Checks that this store has no duplicate triples, and try to rebuild if this fails to happen.

Parameters:
iterable - the elements with which the store will be refilled if there are duplicate triples.
Throws:
IllegalArgumentException - if after a few trials the store still contains duplicate triples.
IOException

log2Chunks

public int log2Chunks(int log2chunks)
Sets the number of chunks.

Once the store is filled, you must call this method to set the number of chunks. The store will take care of merging or fragmenting disk chunks to get exactly the desired chunks.

Parameters:
log2chunks - the base-2 logarithm of the number of chunks.
Returns:
the shift to be applied to the first hash of a triple to get the chunk number (see the introduction).

filter

public void filter(org.apache.commons.collections.Predicate filter)
Sets a filter for this store.

Parameters:
filter - a predicate that will be used to filter triples.

iterator

public Iterator<ChunkedHashStore.Chunk> iterator()
Returns an iterator over the chunks of this chunked hash store.

Specified by:
iterator in interface Iterable<ChunkedHashStore.Chunk>
Returns:
an iterator over the chunks of this chunked hash store.