it.unimi.dsi.mg4j.tool
Class Combine

java.lang.Object
  extended by it.unimi.dsi.mg4j.tool.Combine
Direct Known Subclasses:
Concatenate, Merge, Paste

public abstract class Combine
extends Object

Combines several indices.

Indices may be combined in several different ways. This abstract class contains code that is common to classes such as Merge or Concatenate: essentially, command line parsing, index opening, and term list fusion is taken care of. Then, the template method combine(int) must write into indexWriter the combined inverted list, returning the resulting frequency. If, however, metadataOnly is true, indexWriter is null and combine(int) must just return the resulting frequency.

Note that by combining a single index into a new one you can recompress an index with different compression parameters (which includes the possibility of eliminating positions or counts). It is also possible to build just the metadata associated to an index (term list, frequencies, global counts).

The subclasses of this class must implement combine(int) so that indices with different sets of features are combined keeping the largest set of features requested by the user. For instance, combining an index with positions and an index with counts, but no positions, should generate an index with counts but no positions.

Warning: a combination requires opening three files per input index, plus a few more files for the output index. If the combination process is interrupted by an exception claiming that there are too many open files, check how to increase the number of files you can open (usually, for instance on UN*X, there is a global and a per-process limit, so be sure to set both).

Read-once indices, readers, and distributed index combination

If the indices and bitstream index readers involved in the combination are read-once (i.e., opening an index and reading once its contents sequentially causes each file composing the index to be read exactly once) then also Combine implementations should be read-once (Concatenate, Merge and Paste are).

This means, in particular, that index combination can be performed from pipes, which in turn can be filled, for instance, with data coming from the network. In other words, albeit this class is theoretically based on a number of indices existing on a local disk, those indices can be substituted with suitable pipes filled with remote data without affecting the combination process. For instance, the following bash code creates three sets of pipes:

 for i in 0 1 2; do
   for e in frequencies globcounts index offsets posnumbits properties sizes terms; do 
     mkfifo pipe$i.$e
   done
 done
 

Each pipe should be then filled with suitable data, for instance obtained from the net (assuming you have indices index0, index1 and index2 on example.com):

 for i in 0 1 2; do 
   for e in frequencies globcounts index offsets posnumbits properties sizes terms; do 
     (ssh -x example.com cat index$i.$e >pipe$i.$e &)
   done
 done
 

Now all pipes will be filled with data from the corresponding remote files, and combining the indices pipe0, pipe1 and pipe2 will give the same result as combining index0, index1 and index2 on the remote system.

Since:
1.0
Author:
Sebastiano Vigna

Nested Class Summary
protected static class Combine.GammaCodedIntIterator
          A partial IntIterator implementation based on γ-coded integers.
 
Field Summary
protected  Properties additionalProperties
          Additional properties for the merged index.
protected  int bufferSize
          The size of I/O buffers.
static int DEFAULT_BUFFER_SIZE
          The default buffer size.
protected  int[] frequency
          For each index, the frequency of the current term (given that it is present).
protected  boolean hasCounts
          Whether indexWriter has counts.
protected  boolean hasPayloads
          Whether indexWriter has payloads.
protected  boolean hasPositions
          Whether indexWriter has positions.
protected  BitStreamIndex[] index
          The array of indices to be merged.
protected  IndexIterator[] indexIterator
          An array of index iterators parallel to index (filled by concrete implementations).
protected  IndexReader[] indexReader
          An array of index readers parallel to index.
protected  IndexWriter indexWriter
          The index writer for the merged index.
protected  String[] inputBasename
          The array of input basenames.
protected  int maxCount
          The maximum count in the merged index.
protected  boolean metadataOnly
          Compute only index metadata (sizes, terms and globcounts).
protected  boolean needsSizes
          True if the index writer needs sizes (usually, because it uses Golomb coding for its positions).
protected  int numberOfDocuments
          The overall number of documents.
protected  long numberOfOccurrences
          The overall number of occurrences.
protected  int numIndices
          The number of indices to be merged.
protected  String outputBasename
          The output basename.
protected  double p
          If nonzero, the fraction of space to be used by variable-quantum skip towers.
protected  int[] position
          A cache for positions.
protected  long predictedLengthNumBits
          The predicted number of bits for the positions the next inverted list to be combined.
protected  long predictedSize
          The predicted size of the non-positional part of next inverted list to be combined.
protected  int[] size
          The array of sizes of the combined index.
protected  ObjectHeapSemiIndirectPriorityQueue<MutableString> termQueue
          The queue containing terms.
protected  int[] usedIndex
          An array partially filled with the indices (as offsets in index) participating to the merge process for the current term.
protected  VariableQuantumIndexWriter variableQuantumIndexWriter
          A copy of indexWriter which is non-null if indexWriter is an instance of VariableQuantumIndexWriter.
 
Constructor Summary
Combine(String outputBasename, String[] inputBasename, boolean metadataOnly, boolean requireSizes, int bufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags, boolean interleaved, boolean skips, int quantum, int height, int skipBufferSize, long logInterval)
          Combines several indices into one.
 
Method Summary
protected abstract  int combine(int numUsedIndices)
          Combines several indices.
protected abstract  int combineNumberOfDocuments()
          Combines the number of documents.
protected abstract  int combineSizes(OutputBitStream sizeOutputBitStream)
          Combines size lists.
static void main(String[] arg)
           
static void main(String[] arg, Class<? extends Combine> combineClass)
           
 void run()
           
protected  IntIterator sizes(int numIndex)
          Returns an iterator on sizes.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_BUFFER_SIZE

public static final int DEFAULT_BUFFER_SIZE
The default buffer size.

See Also:
Constant Field Values

numIndices

protected final int numIndices
The number of indices to be merged.


index

protected final BitStreamIndex[] index
The array of indices to be merged.


indexReader

protected final IndexReader[] indexReader
An array of index readers parallel to index.


indexIterator

protected final IndexIterator[] indexIterator
An array of index iterators parallel to index (filled by concrete implementations).


metadataOnly

protected final boolean metadataOnly
Compute only index metadata (sizes, terms and globcounts).


termQueue

protected ObjectHeapSemiIndirectPriorityQueue<MutableString> termQueue
The queue containing terms.


numberOfDocuments

protected final int numberOfDocuments
The overall number of documents.


numberOfOccurrences

protected long numberOfOccurrences
The overall number of occurrences.


maxCount

protected int maxCount
The maximum count in the merged index.


inputBasename

protected final String[] inputBasename
The array of input basenames.


outputBasename

protected final String outputBasename
The output basename.


bufferSize

protected final int bufferSize
The size of I/O buffers.


p

protected final double p
If nonzero, the fraction of space to be used by variable-quantum skip towers.


indexWriter

protected IndexWriter indexWriter
The index writer for the merged index.


variableQuantumIndexWriter

protected VariableQuantumIndexWriter variableQuantumIndexWriter
A copy of indexWriter which is non-null if indexWriter is an instance of VariableQuantumIndexWriter.


hasCounts

protected final boolean hasCounts
Whether indexWriter has counts.


hasPositions

protected final boolean hasPositions
Whether indexWriter has positions.


hasPayloads

protected final boolean hasPayloads
Whether indexWriter has payloads.


additionalProperties

protected final Properties additionalProperties
Additional properties for the merged index.


usedIndex

protected final int[] usedIndex
An array partially filled with the indices (as offsets in index) participating to the merge process for the current term.


frequency

protected final int[] frequency
For each index, the frequency of the current term (given that it is present).


position

protected int[] position
A cache for positions.


needsSizes

protected final boolean needsSizes
True if the index writer needs sizes (usually, because it uses Golomb coding for its positions).


size

protected int[] size
The array of sizes of the combined index. This is set up by combineSizes(OutputBitStream) by the combiners who need it.


predictedSize

protected long predictedSize
The predicted size of the non-positional part of next inverted list to be combined. It will be -1, unless p is not zero.


predictedLengthNumBits

protected long predictedLengthNumBits
The predicted number of bits for the positions the next inverted list to be combined. It will be -1, unless p is not zero.

Constructor Detail

Combine

public Combine(String outputBasename,
               String[] inputBasename,
               boolean metadataOnly,
               boolean requireSizes,
               int bufferSize,
               Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags,
               boolean interleaved,
               boolean skips,
               int quantum,
               int height,
               int skipBufferSize,
               long logInterval)
        throws IOException,
               ConfigurationException,
               URISyntaxException,
               ClassNotFoundException,
               SecurityException,
               InstantiationException,
               IllegalAccessException,
               InvocationTargetException,
               NoSuchMethodException
Combines several indices into one.

Parameters:
outputBasename - the basename of the combined index.
inputBasename - the basenames of the input indices.
metadataOnly - if true, we save only metadata (term list, frequencies, global counts).
requireSizes - if true, the sizes of input indices will be forced to be loaded.
bufferSize - the buffer size for index readers.
writerFlags - the flags for the index writer.
interleaved - forces an interleaved index.
skips - whether to insert skips in case interleaved is true.
quantum - the quantum of skipping structures; if negative, a percentage of space for variable-quantum indices (irrelevant if skips is false).
height - the height of skipping towers (irrelevant if skips is false).
skipBufferSize - the size of the buffer used to hold temporarily inverted lists during the skipping structure construction.
logInterval - how often we log.
Throws:
IOException
ConfigurationException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
Method Detail

combineNumberOfDocuments

protected abstract int combineNumberOfDocuments()
Combines the number of documents.

Returns:
the number of documents of the combined index.

sizes

protected IntIterator sizes(int numIndex)
                     throws FileNotFoundException
Returns an iterator on sizes.

The purpose of this method is to provide combineSizes(OutputBitStream) implementations with a way to access the size list from a disk file or from Index.sizes transparently. This mechanism is essential to ensure that size files are read exactly once.

The caller should check whether the returned object implements Closeable, and, in this case, invoke Closeable.close() after usage.

Parameters:
numIndex - the number of an index.
Returns:
an iterator on the sizes of the index.
Throws:
FileNotFoundException

combineSizes

protected abstract int combineSizes(OutputBitStream sizeOutputBitStream)
                             throws IOException
Combines size lists.

Returns:
the maximum size of a document in the combined index.
Throws:
IOException

combine

protected abstract int combine(int numUsedIndices)
                        throws IOException
Combines several indices.

When this method is called, exactly numUsedIndices entries of usedIndex contain, in increasing order, the indices containing inverted lists for the current term. Implementations of this method must combine the inverted list, save the total global count for the current term and return the resulting frequency.

Parameters:
numUsedIndices - the number of valid entries in usedIndex.
Returns:
the frequency of the combined lists.
Throws:
IOException

run

public void run()
         throws ConfigurationException,
                IOException
Throws:
ConfigurationException
IOException

main

public static void main(String[] arg)
                 throws com.martiansoftware.jsap.JSAPException,
                        ConfigurationException,
                        IOException,
                        URISyntaxException,
                        ClassNotFoundException,
                        SecurityException,
                        InstantiationException,
                        IllegalAccessException,
                        InvocationTargetException,
                        NoSuchMethodException
Throws:
com.martiansoftware.jsap.JSAPException
ConfigurationException
IOException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException

main

public static void main(String[] arg,
                        Class<? extends Combine> combineClass)
                 throws com.martiansoftware.jsap.JSAPException,
                        ConfigurationException,
                        IOException,
                        URISyntaxException,
                        ClassNotFoundException,
                        SecurityException,
                        InstantiationException,
                        IllegalAccessException,
                        InvocationTargetException,
                        NoSuchMethodException
Throws:
com.martiansoftware.jsap.JSAPException
ConfigurationException
IOException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException