|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.dsi.sux4j.io.ChunkedHashStore<T>
public class ChunkedHashStore<T>
A temporary store of hash triples virtually divided into chunks.
A chunked hash store accumulates objects of type T
by turning them into bit vectors (using a provided TransformationStrategy
)
and then hashing such vectors into a triple of longs (i.e., overall we get a hash of 192 bits).
Elements can be added one by one
or in batches. Besides the hashes, we store the offset of each element added (the first element added has offset 0,
the second one has offset 1, and so on). Elements must be distinct, or more precisely, they must be transformed into distinct bit vectors.
Once all elements have been added, they can be gathered into chunks whose
tentative size can be set by calling log2Chunks(int)
. More precisely,
if the latter method is called with argument k, then each chunk
will be formed by grouping triples by the k most significant bits of their first hash.
To obtain triples, one calls iterator()
, which returns chunks one at a time (in their
natural order); triples within each chunk are returned by increasing hash. Actually, the iterator
provided by a chunk returns a quadruple whose last element is the offset of the element
that generated the triple.
It is possible (albeit unlikely) that different elements generate the same hash. This event is detected
during chunk iteration (not while accumulating hashes), and it will throw a ChunkedHashStore.DuplicateException
.
At that point, the caller must handle the exception by resetting the store ant trying again
from scratch. Note that after a few (say, three) exceptions you can assume that there are duplicate elements. If you
need to force a check on the whole store you can call check()
. If all your elements come from an Iterable
,
checkAndRetry(Iterable)
will try three times to build a checked chunked hash store.
Note that every reset(long)
changes the seed used by the store to generate triples. So, if this seed has to be
stored this must happen after the last call to reset(long)
. To help tracking this fact, a call to
seed()
will lock the store; any further call to reset(long)
will throw an IllegalStateException
.
In case the store needs to be reused, you can call clear()
, that will bring back the store to after-creation state.
When you have finished using a chunked hash store, you should close()
it. This class implements
SafelyCloseable
, and thus provides a safety-net finalizer.
You can at any time set a predicate that will filter the triples returned by the store.
Internally, a chunked hash store has a notion of disk chunk: triples are stored on disk using a fixed number of bits. Once the user chooses a chunk size, the store exhibits the data on disk by grouping disk chunks or splitting them in a suitable way. This process is transparent to the user.
An instance of this class will save triples into DISK_CHUNKS
disk chunks. Triples have to
be loaded into memory only chunk by chunk, so to be sorted and tested for uniqueness. As long as
DISK_CHUNKS
is larger than eight, the store will need less than one bit per element of main
memory. DISK_CHUNKS
can be increased arbitrarily at compile time, but each store
will open DISK_CHUNKS
files at the same time. (For the same reason, it is
strongly suggested that you close your stores as soon as you do not need them).
Chunked hash stores should be built by classes that need to manipulate elements in chunks of approximate given
size without needing access to the elements themselves, but just to their triples, a typical
example being MinimalPerfectHashFunction
, which uses the triples to compute a 3-hyperedge. Once a chunked hash
store is built, it can be passed on to further substructures, reducing greatly the computation time (as the original
collection need not to be scanned again).
To compute the chunk corresponding to given element, use
final long[] h = new long[ 3 ]; Hashes.jenkins( transform.toBitVector( key ), seed, h ); final int chunk = chunkShift == Long.SIZE ? 0 : (int)( h[ 0 ] >>> chunkShift );where
seed
is the store seed, and chunkShift
is the return value of log2Chunks(int)
and should be stored by the caller. Note
that you do not need the chunked hash store to compute these data, but just its seed and the chunk shift.
Nested Class Summary | |
---|---|
static class |
ChunkedHashStore.Chunk
A chunk returned by a ChunkedHashStore . |
static class |
ChunkedHashStore.DuplicateException
Denotes that the chunked hash store contains a duplicate hash triple. |
Field Summary | |
---|---|
static int |
DISK_CHUNKS
The number of physical disk chunks. |
static int |
DISK_CHUNKS_SHIFT
The shift for physical disk chunks. |
static int |
LOG2_DISK_CHUNKS
The logarithm of the number of physical disk chunks. |
protected int |
n
The number of elements that pass the current filter. |
protected long |
seed
The seed used to generate the hash triples. |
static long |
serialVersionUID
|
Constructor Summary | |
---|---|
ChunkedHashStore(TransformationStrategy<? super T> transform)
Creates a chunked hash store with given transformation strategy. |
|
ChunkedHashStore(TransformationStrategy<? super T> transform,
ProgressLogger pl)
Creates a chunked hash store with given transformation strategy and progress logger. |
Method Summary | |
---|---|
void |
add(T o)
Adds an element to this store. |
void |
addAll(Iterator<? extends T> iterator)
Adds the elements returned by an iterator to this store. |
void |
check()
Checks that this store has no duplicate triples, throwing an exception if this fails to happen. |
void |
checkAndRetry(Iterable<? extends T> iterable)
Checks that this store has no duplicate triples, and try to rebuild if this fails to happen. |
void |
clear()
Clears this store. |
void |
close()
Closes this store, disposing all associated resources. |
void |
filter(org.apache.commons.collections.Predicate filter)
Sets a filter for this store. |
protected void |
finalize()
|
Iterator<ChunkedHashStore.Chunk> |
iterator()
Returns an iterator over the chunks of this chunked hash store. |
int |
log2Chunks(int log2chunks)
Sets the number of chunks. |
void |
reset(long seed)
Resets this store using a new seed. |
long |
seed()
Return the current seed of this chunked hash store. |
int |
size()
Returns the size of this store. |
Methods inherited from class java.lang.Object |
---|
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final long serialVersionUID
public static final int LOG2_DISK_CHUNKS
public static final int DISK_CHUNKS
public static final int DISK_CHUNKS_SHIFT
protected int n
protected long seed
Constructor Detail |
---|
public ChunkedHashStore(TransformationStrategy<? super T> transform) throws IOException
transform
- a transformation strategy for the elements.
IOException
public ChunkedHashStore(TransformationStrategy<? super T> transform, ProgressLogger pl) throws IOException
transform
- a transformation strategy for the elements.pl
- a progress logger, or null
.
IOException
Method Detail |
---|
public long seed()
reset(long)
will be allowed (unless the store
is cleared).
public void add(T o) throws IOException
o
- the element to be added.
IOException
public void addAll(Iterator<? extends T> iterator) throws IOException
iterator
- an iterator returning elements.
IOException
public int size() throws IOException
IOException
public void clear()
protected void finalize() throws Throwable
finalize
in class Object
Throwable
public void close()
close
in interface Closeable
public void reset(long seed)
seed
- the new seed.
IllegalStateException
- if this store was locked by a call to seed()
, and never cleared thereafter.public void check() throws ChunkedHashStore.DuplicateException
ChunkedHashStore.DuplicateException
- if this store contains duplicate triples.public void checkAndRetry(Iterable<? extends T> iterable) throws IOException
iterable
- the elements with which the store will be refilled if there are duplicate triples.
IllegalArgumentException
- if after a few trials the store still contains duplicate triples.
IOException
public int log2Chunks(int log2chunks)
Once the store is filled, you must call this method to set the number of chunks. The store will take care of merging or fragmenting disk chunks to get exactly the desired chunks.
log2chunks
- the base-2 logarithm of the number of chunks.
public void filter(org.apache.commons.collections.Predicate filter)
filter
- a predicate that will be used to filter triples.public Iterator<ChunkedHashStore.Chunk> iterator()
iterator
in interface Iterable<ChunkedHashStore.Chunk>
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |