it.unimi.dsi.mg4j.tool
Class Paste

java.lang.Object
  extended by it.unimi.dsi.mg4j.tool.Combine
      extended by it.unimi.dsi.mg4j.tool.Paste

public final class Paste
extends Combine

Pastes several indices.

Pasting is a very slow way of combining indices: we assume that not only documents, but also document occurrences might be scattered throughout several indices. When a document appears in several indices, its occurrences in a given index are combined by renumbering them starting from the sum of the sizes for the document in the previous indices.

Conceptually, this operation is equivalent to splitting a collection vertically: each document is divided into a fixed number n of consecutive segments (possibly of length 0), and a set of n indices is created using the k-th segment of all documents. Pasting the resulting indices will produce an index that is identical to the index generated by the original collection. The behaviour is analogous to that of the UN*X paste command if documents are single-line lists of words.

In pratice, pasting is usually applied to indices obtained from a virtual field (e.g., indices containing anchor text fragments).

Note that in case every document appears at most in one index pasting is equivalent to merging. It is, however, significantly slower, as the presence of the same document in several lists makes it necessary to scan completely the inverted lists to be pasted to compute the frequency.

Since:
1.0
Author:
Sebastiano Vigna

Nested Class Summary
 
Nested classes/interfaces inherited from class it.unimi.dsi.mg4j.tool.Combine
Combine.GammaCodedIntIterator
 
Field Summary
static int DEFAULT_MEMORY_BUFFER_SIZE
          The default size of the temporary bit stream buffer used while pasting.
protected  int[] doc
          The reference array of the document queue.
protected  IntHeapPriorityQueue documentQueue
          The queue containing document pointers (for remapped indices).
 
Fields inherited from class it.unimi.dsi.mg4j.tool.Combine
DEFAULT_BUFFER_SIZE, frequency, hasCounts, hasPayloads, hasPositions, index, indexIterator, indexReader, indexWriter, inputBasename, maxCount, numberOfDocuments, numberOfOccurrences, numIndices, position, size, termQueue, usedIndex
 
Constructor Summary
Paste(String outputBasename, String[] inputBasename, boolean metadataOnly, int bufferSize, File tempFileDir, int tempBufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags, boolean interleaved, boolean skips, int quantum, int height, int skipBufferSize, long logInterval)
           
 
Method Summary
protected  int combine(int numUsedIndices)
          Combines several indices.
protected  int combineNumberOfDocuments()
          Combines the number of documents.
protected  int combineSizes()
          Combines size lists.
protected  BitStreamIndex getIndex(CharSequence basename)
          Returns an index with given basename, loading document sizes.
static void main(String[] arg)
           
 void run()
           
 
Methods inherited from class it.unimi.dsi.mg4j.tool.Combine
main, sizes
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_MEMORY_BUFFER_SIZE

public static final int DEFAULT_MEMORY_BUFFER_SIZE
The default size of the temporary bit stream buffer used while pasting. Posting lists larger than this size will be precomputed on disk and then added to the index.

See Also:
Constant Field Values

doc

protected int[] doc
The reference array of the document queue.


documentQueue

protected IntHeapPriorityQueue documentQueue
The queue containing document pointers (for remapped indices).

Constructor Detail

Paste

public Paste(String outputBasename,
             String[] inputBasename,
             boolean metadataOnly,
             int bufferSize,
             File tempFileDir,
             int tempBufferSize,
             Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags,
             boolean interleaved,
             boolean skips,
             int quantum,
             int height,
             int skipBufferSize,
             long logInterval)
      throws IOException,
             ConfigurationException,
             URISyntaxException,
             ClassNotFoundException,
             SecurityException,
             InstantiationException,
             IllegalAccessException,
             InvocationTargetException,
             NoSuchMethodException
Throws:
IOException
ConfigurationException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
Method Detail

getIndex

protected BitStreamIndex getIndex(CharSequence basename)
                           throws ConfigurationException,
                                  IOException,
                                  URISyntaxException,
                                  ClassNotFoundException,
                                  SecurityException,
                                  InstantiationException,
                                  IllegalAccessException,
                                  InvocationTargetException,
                                  NoSuchMethodException
Returns an index with given basename, loading document sizes.

Overrides:
getIndex in class Combine
Parameters:
basename - an index basename.
Returns:
an index loaded with document sizes.
Throws:
ConfigurationException
IOException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException

combineNumberOfDocuments

protected int combineNumberOfDocuments()
Description copied from class: Combine
Combines the number of documents.

Specified by:
combineNumberOfDocuments in class Combine
Returns:
the number of documents of the combined index.

combineSizes

protected int combineSizes()
                    throws IOException
Description copied from class: Combine
Combines size lists.

Specified by:
combineSizes in class Combine
Returns:
the maximum size of a document in the combined index.
Throws:
IOException

combine

protected int combine(int numUsedIndices)
               throws IOException
Description copied from class: Combine
Combines several indices.

When this method is called, exactly numUsedIndices entries of Combine.usedIndex contain, in increasing order, the indices containing inverted lists for the current term. Implementations of this method must combine the inverted list, save the total global count for the current term and return the resulting frequency.

Specified by:
combine in class Combine
Parameters:
numUsedIndices - the number of valid entries in Combine.usedIndex.
Returns:
the frequency of the combined lists.
Throws:
IOException

run

public void run()
         throws ConfigurationException,
                IOException
Overrides:
run in class Combine
Throws:
ConfigurationException
IOException

main

public static void main(String[] arg)
                 throws ConfigurationException,
                        SecurityException,
                        com.martiansoftware.jsap.JSAPException,
                        IOException,
                        URISyntaxException,
                        ClassNotFoundException,
                        InstantiationException,
                        IllegalAccessException,
                        InvocationTargetException,
                        NoSuchMethodException
Throws:
ConfigurationException
SecurityException
com.martiansoftware.jsap.JSAPException
IOException
URISyntaxException
ClassNotFoundException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException