|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.dsi.mg4j.index.DiskBasedIndex
public class DiskBasedIndex
A static container providing facilities to load an index based on data stored on disk.
This class contains several useful static methods
such as readOffsets(InputBitStream, int)
and readSizes(InputBitStream, int)
,
and static factor methods such as getInstance(CharSequence, boolean, boolean, boolean, EnumMap)
that take care of reading the properties associated to the index, identify
the correct Index
implementation that
should be used to load the index, and load the necessary data into memory.
As an option, a disk-based index can be loaded into main memory (key: Index.UriKeys.INMEMORY
), returning
an InMemoryIndex
/InMemoryHPIndex
, or mapped into main memory (key: Index.UriKeys.MAPPED
),
returning a MemoryMappedIndex
/InMemoryHPIndex
(note that the value assigned to the keys is irrelevant).
In both cases some insurmountable Java problems
prevents using indices whose size exceeds two gigabytes (but see MemoryMappedIndex
for
some elaboration on this topic).
Moreover, by default the
term-offset list is accessed using a SemiExternalOffsetList
with a step of DEFAULT_OFFSET_STEP
. This behaviour can be changed using
the URI key Index.UriKeys.OFFSETSTEP
.
Disk-based indices are the workhorse of MG4J. All other indices (clustered, remote, etc.) ultimately rely on disk-based indices to provide results.
Note that not all data produced by Scan
and
by the other indexing utilities are actually necessary to run a disk-based
index. Usually the property file and the index file (plus the positions file,
for high-performance indices) are sufficient: if one
needs random access, also the offsets file must be present, and if the
compression method requires document sizes or if sizes are requested explicitly,
also the sizes file must be present. A StringMap
and possibly a PrefixMap
will be fetched
automatically by getInstance(CharSequence, boolean, boolean)
using standard extensions.
A disk-based index is thread safe as long as the offset list, the size list and the term/prefix map are. The static factory methods provided by this class load offsets and sizes using data structures that are thread safe. If you use directly a constructor, instead, it is your responsability to pass thread-safe data structures.
Field Summary | |
---|---|
static int |
DEFAULT_OFFSET_STEP
The default value for the query parameter Index.UriKeys.OFFSETSTEP . |
static String |
FREQUENCIES_EXTENSION
Standard extension for the file of frequencies. |
static String |
GLOBCOUNTS_EXTENSION
Standard extension for the file of global counts. |
static String |
INDEX_EXTENSION
Standard extension for the index bitstream. |
static String |
OFFSETS_EXTENSION
Standard extension for the file of offsets. |
static String |
POSITIONS_EXTENSION
Standard extension for the positions bitstream of an high-performance index. |
static String |
PREFIXMAP_EXTENSION
Standard extension for the prefix map. |
static String |
PROPERTIES_EXTENSION
Standard extension for the index properties. |
static String |
SIZES_EXTENSION
Standard extension for the file of sizes. |
static String |
STATS_EXTENSION
Standard extension for the stats file. |
static String |
TERMMAP_EXTENSION
Standard extension for the term map. |
static String |
TERMS_EXTENSION
Standard extension for the file of terms. |
static String |
UNSORTED_TERMS_EXTENSION
Standard extension for the file of terms, unsorted. |
Method Summary | |
---|---|
static BitStreamIndex |
getInstance(CharSequence basename)
Returns a new local index, trying to guess reasonable term and prefix maps from the basename, loading offsets but loading document sizes only if it is necessary. |
static BitStreamIndex |
getInstance(CharSequence basename,
boolean randomAccess)
Returns a new local index, trying to guess reasonable term and prefix maps from the basename, and loading document sizes only if it is necessary. |
static BitStreamIndex |
getInstance(CharSequence basename,
boolean randomAccess,
boolean documentSizes)
Returns a new disk-based index, guessing reasonable term and prefix maps from the basename. |
static BitStreamIndex |
getInstance(CharSequence basename,
boolean randomAccess,
boolean documentSizes,
boolean maps)
Returns a new disk-based index, using preloaded Properties and possibly guessing reasonable term and prefix maps from the basename. |
static BitStreamIndex |
getInstance(CharSequence basename,
boolean randomAccess,
boolean documentSizes,
boolean maps,
EnumMap<Index.UriKeys,String> queryProperties)
Returns a new disk-based index, possibly guessing reasonable term and prefix maps from the basename. |
static BitStreamIndex |
getInstance(CharSequence basename,
Properties properties,
boolean randomAccess,
boolean documentSizes,
boolean maps,
EnumMap<Index.UriKeys,String> queryProperties)
Returns a new disk-based index, using preloaded Properties and possibly guessing reasonable term and prefix maps from the basename. |
static BitStreamIndex |
getInstance(CharSequence basename,
Properties properties,
StringMap<? extends CharSequence> termMap,
PrefixMap<? extends CharSequence> prefixMap,
boolean randomAccess,
boolean documentSizes,
EnumMap<Index.UriKeys,String> queryProperties)
Returns a new disk-based index, loading exactly the specified parts and using preloaded Properties . |
static PrefixMap<? extends CharSequence> |
loadPrefixMap(String filename)
Utility static method that loads a prefix map. |
static StringMap<? extends CharSequence> |
loadStringMap(String filename)
Utility static method that loads a term map. |
static LongList |
readOffsets(InputBitStream in,
int T)
Utility method to load a compressed offset file into a list. |
static IntList |
readSizes(InputBitStream in,
int N)
Utility method to load a compressed size file into a list. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int DEFAULT_OFFSET_STEP
Index.UriKeys.OFFSETSTEP
.
public static final String INDEX_EXTENSION
public static final String POSITIONS_EXTENSION
public static final String PROPERTIES_EXTENSION
public static final String SIZES_EXTENSION
public static final String OFFSETS_EXTENSION
public static final String GLOBCOUNTS_EXTENSION
public static final String FREQUENCIES_EXTENSION
public static final String TERMS_EXTENSION
public static final String UNSORTED_TERMS_EXTENSION
public static final String TERMMAP_EXTENSION
public static final String PREFIXMAP_EXTENSION
public static final String STATS_EXTENSION
Method Detail |
---|
public static LongList readOffsets(InputBitStream in, int T) throws IOException
in
- the input bit stream providing the offsets (see BitStreamIndexWriter
).T
- the number of terms indexed.
T
that gives the number
of bytes of the index file.
IOException
public static IntList readSizes(InputBitStream in, int N) throws IOException
in
- the input bit stream providing the offsets (see BitStreamIndexWriter
).N
- the number of documents indexed.
IOException
public static StringMap<? extends CharSequence> loadStringMap(String filename) throws IOException
filename
- the name of the file containing the term map.
null
if the file did not exist.
IOException
- if some IOException (other than FileNotFoundException
) occurred.public static PrefixMap<? extends CharSequence> loadPrefixMap(String filename) throws IOException
filename
- the name of the file containing the prefix map.
null
if the file did not exist.
IOException
- if some IOException (other than FileNotFoundException
) occurred.public static BitStreamIndex getInstance(CharSequence basename, Properties properties, StringMap<? extends CharSequence> termMap, PrefixMap<? extends CharSequence> prefixMap, boolean randomAccess, boolean documentSizes, EnumMap<Index.UriKeys,String> queryProperties) throws ClassNotFoundException, IOException, InstantiationException, IllegalAccessException
Properties
.
basename
- the basename of the index.properties
- the properties obtained from the given basename.termMap
- the term map for this index, or null
for no term map.prefixMap
- the prefix map for this index, or null
for no prefix map.randomAccess
- whether the index should be accessible randomly (e.g., if it will
be possible to call IndexReader.documents(int)
on the index readers returned by the index).documentSizes
- if true, document sizes will be loaded (note that sometimes document sizes
might be loaded anyway because the compression method for positions requires it).queryProperties
- a map containing associations between Index.UriKeys
and values, or null
.
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException
public static BitStreamIndex getInstance(CharSequence basename, Properties properties, boolean randomAccess, boolean documentSizes, boolean maps, EnumMap<Index.UriKeys,String> queryProperties) throws ClassNotFoundException, IOException, InstantiationException, IllegalAccessException
Properties
and possibly guessing reasonable term and prefix maps from the basename.
basename
- the basename of the index.properties
- the properties obtained by stemming basename
.randomAccess
- whether the index should be accessible randomly.documentSizes
- if true, document sizes will be loaded.maps
- if true, term and prefix maps will be guessed and loaded.queryProperties
- a map containing associations between Index.UriKeys
and values, or null
.
IllegalAccessException
InstantiationException
ClassNotFoundException
IOException
getInstance(CharSequence, Properties, StringMap, PrefixMap, boolean, boolean, EnumMap)
public static BitStreamIndex getInstance(CharSequence basename, boolean randomAccess, boolean documentSizes, boolean maps, EnumMap<Index.UriKeys,String> queryProperties) throws ConfigurationException, ClassNotFoundException, IOException, InstantiationException, IllegalAccessException
If there is a term map file (basename stemmed with .termmap), it is used as term map and,
in case it implements PrefixMap
. Otherwise, we search for a prefix map (basename stemmed with .prefixmap)
and, if it implements StringMap
and no term map has been found, we use it as prefix map.
basename
- the basename of the index.randomAccess
- whether the index should be accessible randomly (e.g., if it will
be possible to call IndexReader.documents(int)
on the index readers returned by the index).documentSizes
- if true, document sizes will be loaded (note that sometimes document sizes
might be loaded anyway because the compression method for positions requires it).maps
- if true, term and prefix maps will be guessed and loaded (this
feature might not be available with some kind of index).queryProperties
- a map containing associations between Index.UriKeys
and values, or null
.
ConfigurationException
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException
public static BitStreamIndex getInstance(CharSequence basename, boolean randomAccess, boolean documentSizes, boolean maps) throws ConfigurationException, ClassNotFoundException, IOException, InstantiationException, IllegalAccessException
Properties
and possibly guessing reasonable term and prefix maps from the basename.
If there is a term map file (basename stemmed with .termmap), it is used as term map and,
in case it implements PrefixMap
. Otherwise, we search for a prefix map (basename stemmed with .prefixmap)
and, if it implements StringMap
and no term map has been found, we use it as prefix map.
basename
- the basename of the index.randomAccess
- whether the index should be accessible randomly (e.g., if it will
be possible to call IndexReader.documents(int)
on the index readers returned by the index).documentSizes
- if true, document sizes will be loaded (note that sometimes document sizes
might be loaded anyway because the compression method for positions requires it).maps
- if true, term and prefix maps will be guessed and loaded (this
feature might not be available with some kind of index).
ConfigurationException
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException
getInstance(CharSequence, boolean, boolean, boolean, EnumMap)
public static BitStreamIndex getInstance(CharSequence basename, boolean randomAccess, boolean documentSizes) throws ConfigurationException, ClassNotFoundException, IOException, InstantiationException, IllegalAccessException
basename
- the basename of the index.randomAccess
- whether the index should be accessible randomly (e.g., if it will
be possible to call IndexReader.documents(int)
on the index readers returned by the index).documentSizes
- if true, document sizes will be loaded (note that sometimes document sizes
might be loaded anyway because the compression method for positions requires it).
ConfigurationException
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException
public static BitStreamIndex getInstance(CharSequence basename, boolean randomAccess) throws ConfigurationException, ClassNotFoundException, IOException, InstantiationException, IllegalAccessException
basename
- the basename of the index.randomAccess
- whether the index should be accessible randomly (e.g., if it will
be possible to call IndexReader.documents(int)
on the index readers returned by the index).
ConfigurationException
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException
public static BitStreamIndex getInstance(CharSequence basename) throws ConfigurationException, ClassNotFoundException, IOException, InstantiationException, IllegalAccessException
basename
- the basename of the index.
ConfigurationException
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |