|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.dsi.mg4j.tool.Combine
public abstract class Combine
Combines several indices.
Indices may be combined in several different ways. This abstract class
contains code that is common to classes such as Merge
or Concatenate
: essentially, command line parsing,
inded opening, and term list fusion is taken care of. Then, the template method
combine(int)
must write into indexWriter
the combined inverted
list, returning the resulting frequency.
Note that by combining a single index into a new one you can recompress an index with different compression parameters (which includes the possibility of eliminating positions or counts).
The subclasses of this class must implement combine(int)
so that indices
with different sets of features are combined keeping the largest set of features requested
by the user. For instance, combining an index with positions and an index with counts, but
no positions, should generate an index with counts but no positions.
Warning: a combination requires opening three files per input index, plus a few more files for the output index. If the combination process is interrupted by an exception claiming that there are too many open files, check how to increase the number of files you can open (usually, for instance on UN*X, there is a global and a per-process limit, so be sure to set both).
If the indices and
bitstream index readers involved in the
combination are read-once (i.e., opening an index and reading once its contents sequentially
causes each file composing the index to be read exactly once)
then also Combine
implementations should be read-once (Concatenate
,
Merge
and Paste
are).
This means, in particular, that index combination can be performed from pipes, which in turn can be filled, for instance, with data coming from the network. In other words, albeit this class is theoretically based on a number of indices existing on a local disk, those indices can be substituted with suitable pipes filled with remote data without affecting the combination process. For instance, the following bash code creates three sets of pipes:
for i in 0 1 2; do for e in frequencies globcounts index offsets properties sizes terms; do mkfifo pipe$i.$e done done
Each pipe should be then filled with suitable data, for instance obtained from the net (assuming you have indices index0, index1 and index2 on example.com):
for i in 0 1 2; do for e in frequencies globcounts index offsets properties sizes terms; do (ssh -x example.com cat index$i.$e >pipe$i.$e &) done done
Now all pipes will be filled with data from the corresponding remote files, and combining the indices pipe0, pipe1 and pipe2 will give the same result as combining index0, index1 and index2 on the remote system.
Nested Class Summary | |
---|---|
protected static class |
Combine.GammaCodedIntIterator
A partial IntIterator implementation based on γ-coded integers. |
Field Summary | |
---|---|
static int |
DEFAULT_BUFFER_SIZE
The default buffer size. |
protected int[] |
frequency
For each index, the frequency of the current term (given that it is present). |
protected boolean |
hasCounts
Whether indexWriter has counts. |
protected boolean |
hasPayloads
Whether indexWriter has payloads. |
protected boolean |
hasPositions
Whether indexWriter has positions. |
protected BitStreamIndex[] |
index
The array of indices to be merged. |
protected IndexIterator[] |
indexIterator
An array of index iterators parallel to index (filled by concrete implementations). |
protected IndexReader[] |
indexReader
An array of index readers parallel to index . |
protected IndexWriter |
indexWriter
The index writer for the merged index. |
protected String[] |
inputBasename
The array of input basenames. |
protected int |
maxCount
The maximum count in the merged index. |
protected int |
numberOfDocuments
The overall number of documents. |
protected long |
numberOfOccurrences
The overall number of occurrences. |
protected int |
numIndices
The number of indices to be merged. |
protected int[] |
position
A cache for positions. |
protected int[] |
size
The size of each document. |
protected ObjectHeapSemiIndirectPriorityQueue<MutableString> |
termQueue
The queue containing terms. |
protected int[] |
usedIndex
An array partially filled with the indices (as offsets in index ) participating to the merge process for the current term. |
Constructor Summary | |
---|---|
Combine(String outputBasename,
String[] inputBasename,
boolean metadataOnly,
int bufferSize,
Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags,
boolean interleaved,
boolean skips,
int quantum,
int height,
int skipBufferSize,
long logInterval)
|
Method Summary | |
---|---|
protected abstract int |
combine(int numUsedIndices)
Combines several indices. |
protected abstract int |
combineNumberOfDocuments()
Combines the number of documents. |
protected abstract int |
combineSizes()
Combines size lists. |
protected BitStreamIndex |
getIndex(CharSequence basename)
Return a index with given basename, loaded with options suitable to perform the combination. |
static void |
main(String[] arg)
|
static void |
main(String[] arg,
Class<? extends Combine> combineClass)
|
void |
run()
|
protected IntIterator |
sizes(int numIndex)
Returns an iterator on sizes. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int DEFAULT_BUFFER_SIZE
protected final int numIndices
protected final BitStreamIndex[] index
protected final IndexReader[] indexReader
index
.
protected final IndexIterator[] indexIterator
index
(filled by concrete implementations).
protected ObjectHeapSemiIndirectPriorityQueue<MutableString> termQueue
protected final int numberOfDocuments
protected long numberOfOccurrences
protected int maxCount
protected final String[] inputBasename
protected IndexWriter indexWriter
protected final boolean hasCounts
indexWriter
has counts.
protected final boolean hasPositions
indexWriter
has positions.
protected final boolean hasPayloads
indexWriter
has payloads.
protected int[] usedIndex
index
) participating to the merge process for the current term.
protected final int[] frequency
protected int[] position
protected int[] size
Constructor Detail |
---|
public Combine(String outputBasename, String[] inputBasename, boolean metadataOnly, int bufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags, boolean interleaved, boolean skips, int quantum, int height, int skipBufferSize, long logInterval) throws IOException, ConfigurationException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
IOException
ConfigurationException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
Method Detail |
---|
protected BitStreamIndex getIndex(CharSequence basename) throws ConfigurationException, IOException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
This basic implementation calls Index.getInstance(CharSequence, boolean, boolean)
with all Boolean parameters set to false. Subclasses can override this
method to load more data.
basename
- an index basename.
ConfigurationException
IOException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
protected abstract int combineNumberOfDocuments()
protected IntIterator sizes(int numIndex) throws FileNotFoundException
The purpose of this method is to provide combineSizes()
implementations with
a way to access the size list from a disk file or from Index.sizes
transparently.
This mechanism is essential to ensure that size files are read exactly once.
The caller should check whether the returned object implements Closeable
,
and, in this case, invoke Closeable.close()
after usage.
numIndex
- the number of an index.
FileNotFoundException
protected abstract int combineSizes() throws IOException
IOException
protected abstract int combine(int numUsedIndices) throws IOException
When this method is called, exactly numUsedIndices
entries
of usedIndex
contain, in increasing order, the indices containing
inverted lists for the current term. Implementations of this method must
combine the inverted list, save the total global count for the current
term and return the resulting frequency.
numUsedIndices
- the number of valid entries in usedIndex
.
IOException
public void run() throws ConfigurationException, IOException
ConfigurationException
IOException
public static void main(String[] arg) throws com.martiansoftware.jsap.JSAPException, ConfigurationException, IOException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
com.martiansoftware.jsap.JSAPException
ConfigurationException
IOException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
public static void main(String[] arg, Class<? extends Combine> combineClass) throws com.martiansoftware.jsap.JSAPException, ConfigurationException, IOException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
com.martiansoftware.jsap.JSAPException
ConfigurationException
IOException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |