|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.dsi.mg4j.tool.Combine
it.unimi.dsi.mg4j.tool.Paste
public final class Paste
Pastes several indices.
Pasting is a very slow way of combining indices: we assume that not only documents, but also document occurrences might be scattered throughout several indices. When a document appears in several indices, its occurrences in a given index are combined by renumbering them starting from the sum of the sizes for the document in the previous indices.
Conceptually, this operation is equivalent to splitting a collection vertically: each document is divided into a fixed number n of consecutive segments (possibly of length 0), and a set of n indices is created using the k-th segment of all documents. Pasting the resulting indices will produce an index that is identical to the index generated by the original collection. The behaviour is analogous to that of the UN*X paste command if documents are single-line lists of words.
In pratice, pasting is usually applied to indices obtained from a virtual field (e.g., indices containing anchor text fragments).
Note that in case every document appears at most in one index pasting is equivalent to merging. It is, however, significantly slower, as the presence of the same document in several lists makes it necessary to scan completely the inverted lists to be pasted to compute the frequency.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class it.unimi.dsi.mg4j.tool.Combine |
---|
Combine.GammaCodedIntIterator |
Field Summary | |
---|---|
static int |
DEFAULT_MEMORY_BUFFER_SIZE
The default size of the temporary bit stream buffer used while pasting. |
protected int[] |
doc
The reference array of the document queue. |
protected IntHeapPriorityQueue |
documentQueue
The queue containing document pointers (for remapped indices). |
Fields inherited from class it.unimi.dsi.mg4j.tool.Combine |
---|
DEFAULT_BUFFER_SIZE, frequency, hasCounts, hasPayloads, hasPositions, index, indexIterator, indexReader, indexWriter, inputBasename, maxCount, numberOfDocuments, numberOfOccurrences, numIndices, position, size, termQueue, usedIndex |
Constructor Summary | |
---|---|
Paste(String outputBasename,
String[] inputBasename,
boolean metadataOnly,
int bufferSize,
File tempFileDir,
int tempBufferSize,
Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags,
boolean interleaved,
boolean skips,
int quantum,
int height,
int skipBufferSize,
long logInterval)
|
Method Summary | |
---|---|
protected int |
combine(int numUsedIndices)
Combines several indices. |
protected int |
combineNumberOfDocuments()
Combines the number of documents. |
protected int |
combineSizes()
Combines size lists. |
protected BitStreamIndex |
getIndex(CharSequence basename)
Returns an index with given basename, loading document sizes. |
static void |
main(String[] arg)
|
void |
run()
|
Methods inherited from class it.unimi.dsi.mg4j.tool.Combine |
---|
main, sizes |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int DEFAULT_MEMORY_BUFFER_SIZE
protected int[] doc
protected IntHeapPriorityQueue documentQueue
Constructor Detail |
---|
public Paste(String outputBasename, String[] inputBasename, boolean metadataOnly, int bufferSize, File tempFileDir, int tempBufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags, boolean interleaved, boolean skips, int quantum, int height, int skipBufferSize, long logInterval) throws IOException, ConfigurationException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
IOException
ConfigurationException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
Method Detail |
---|
protected BitStreamIndex getIndex(CharSequence basename) throws ConfigurationException, IOException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
getIndex
in class Combine
basename
- an index basename.
ConfigurationException
IOException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
protected int combineNumberOfDocuments()
Combine
combineNumberOfDocuments
in class Combine
protected int combineSizes() throws IOException
Combine
combineSizes
in class Combine
IOException
protected int combine(int numUsedIndices) throws IOException
Combine
When this method is called, exactly numUsedIndices
entries
of Combine.usedIndex
contain, in increasing order, the indices containing
inverted lists for the current term. Implementations of this method must
combine the inverted list, save the total global count for the current
term and return the resulting frequency.
combine
in class Combine
numUsedIndices
- the number of valid entries in Combine.usedIndex
.
IOException
public void run() throws ConfigurationException, IOException
run
in class Combine
ConfigurationException
IOException
public static void main(String[] arg) throws ConfigurationException, SecurityException, com.martiansoftware.jsap.JSAPException, IOException, URISyntaxException, ClassNotFoundException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
ConfigurationException
SecurityException
com.martiansoftware.jsap.JSAPException
IOException
URISyntaxException
ClassNotFoundException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |