org.exist.storage
Class NativeTextEngine

java.lang.Object
  extended by java.util.Observable
      extended by org.exist.storage.TextSearchEngine
          extended by org.exist.storage.NativeTextEngine
All Implemented Interfaces:
ContentLoadingObserver

public class NativeTextEngine
extends TextSearchEngine
implements ContentLoadingObserver

This class is responsible for fulltext-indexing. Text-nodes are handed over to this class to be fulltext-indexed. Method storeText() is called by RelationalBroker whenever it finds a TextNode. Method getNodeIDsContaining() is used by the XPath-engine to process queries where a fulltext-operator is involved. The class keeps two database tables: table dbTokens stores the words found with their unique id. Table invertedIndex contains the word occurrences for every word-id per document. TODO: store node type (attribute or text) with each entry

Author:
Wolfgang Meier

Field Summary
static int ATTRIBUTE_BY_QNAME
           
static int ATTRIBUTE_NOT_BY_QNAME
           
static byte ATTRIBUTE_SECTION
           
static double DEFAULT_WORD_CACHE_GROWTH
           
static double DEFAULT_WORD_KEY_THRESHOLD
           
static double DEFAULT_WORD_VALUE_THRESHOLD
           
static int DO_NOT_TOKENIZE
           
static String FILE_KEY_IN_CONFIG
           
static String FILE_NAME
           
static int FOURTH_OPTION
           
static int LENGTH_NODE_IDS_FREQ_OFFSETS
           
static int LENGTH_NODE_TYPE
           
static int MAX_TOKEN_LENGTH
          Length limit for the tokens
static int OFFSET_ATTRIBUTE_DLN_LENGTH
           
static int OFFSET_DLN
           
static int OFFSET_ELEMENT_CHILDREN_COUNT
           
static int OFFSET_NODE_TYPE
           
static int OFFSET_TEXT_DLN_LENGTH
           
static byte QNAME_SECTION
           
static int TEXT_BY_QNAME
           
static byte TEXT_SECTION
           
static int TOKENIZE
           
 
Fields inherited from class org.exist.storage.TextSearchEngine
CONFIGURATION_STOPWORDS_ELEMENT_NAME, INDEX_NUMBERS_ATTRIBUTE, PROPERTY_INDEX_NUMBERS, PROPERTY_STEM, PROPERTY_STOPWORD_FILE, PROPERTY_STORE_TERM_FREQUENCY, PROPERTY_TOKENIZER, STEM_ATTRIBUTE, STOPWORD_FILE_ATTRIBUTE, STORE_TERM_FREQUENCY_ATTRIBUTE, TOKENIZER_ATTRIBUTE
 
Constructor Summary
NativeTextEngine(DBBroker broker, BFile dbFile, Configuration config)
           
 
Method Summary
 boolean close()
           
 void closeAndRemove()
           
static boolean containsWildcards(String str)
          Checks if the given string could be a regular expression.
 void dropIndex(Collection collection)
          Remove index entries for an entire collection.
 void dropIndex(DocumentImpl document)
          Remove all index entries for the given document.
 void flush()
           
 String getConfigKeyForFile()
           
 String getFileName()
           
 String[] getIndexTerms(DocumentSet docs, TermMatcher matcher)
           
 NativeTextEngine getInstance()
           
 NodeSet getNodes(XQueryContext context, DocumentSet docs, NodeSet contextSet, int axis, QName qname, TermMatcher matcher, CharSequence startTerm)
           
 NodeSet getNodesContaining(XQueryContext context, DocumentSet docs, NodeSet contextSet, int axis, QName qname, String expr, int type, boolean matchAll)
          For each of the given search terms and each of the documents in the document set, return a node-set of matching nodes.
 NodeSet getNodesExact(XQueryContext context, DocumentSet docs, NodeSet contextSet, int axis, QName qname, String expr)
          Get all nodes whose content exactly matches the give expression.
 int getTrackMatches()
           
 void printStatistics()
           
 void remove()
          remove all pending modifications, for the current document.
 void removeNode(StoredNode node, NodePath currentPath, String content)
          The given node is being removed from the database.
 Occurrences[] scanIndexTerms(DocumentSet docs, NodeSet contextSet, QName[] qnames, String start, String end)
           
 Occurrences[] scanIndexTerms(DocumentSet docs, NodeSet contextSet, String start, String end)
          Queries the fulltext index to retrieve information on indexed words contained in the index for the current collection.
 void setDocument(DocumentImpl document)
          set the current document; generally called before calling an operation
 void setTrackMatches(int flags)
           
static boolean startsWithWildcard(String str)
           
 void storeAttribute(AttrImpl node, NodePath currentPath, int indexingHint, FulltextIndexSpec indexSpec, boolean remove)
          Indexes the tokens contained in an attribute.
 void storeAttribute(AttrImpl node, NodePath currentPath, int indexingHint, RangeIndexSpec idx, boolean remove)
          store and index given attribute
 void storeText(StoredNode parent, ElementContent text, int indexingHint, FulltextIndexSpec indexSpec, boolean remove)
           
 void storeText(TextImpl node, int indexingHint, FulltextIndexSpec indexSpec, boolean remove)
          Indexes the tokens contained in a text node.
 void storeText(TextImpl node, NodePath currentPath, int indexingHint)
          store and index given text node
 void sync()
          triggers a cache sync, i.e.
 String toString()
           
 
Methods inherited from class org.exist.storage.TextSearchEngine
getNodesContaining, getTokenizer
 
Methods inherited from class java.util.Observable
addObserver, countObservers, deleteObserver, deleteObservers, hasChanged, notifyObservers, notifyObservers
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

FILE_NAME

public static final String FILE_NAME
See Also:
Constant Field Values

FILE_KEY_IN_CONFIG

public static final String FILE_KEY_IN_CONFIG
See Also:
Constant Field Values

DEFAULT_WORD_CACHE_GROWTH

public static final double DEFAULT_WORD_CACHE_GROWTH
See Also:
Constant Field Values

DEFAULT_WORD_KEY_THRESHOLD

public static final double DEFAULT_WORD_KEY_THRESHOLD
See Also:
Constant Field Values

DEFAULT_WORD_VALUE_THRESHOLD

public static final double DEFAULT_WORD_VALUE_THRESHOLD
See Also:
Constant Field Values

TEXT_SECTION

public static final byte TEXT_SECTION
See Also:
Constant Field Values

ATTRIBUTE_SECTION

public static final byte ATTRIBUTE_SECTION
See Also:
Constant Field Values

QNAME_SECTION

public static final byte QNAME_SECTION
See Also:
Constant Field Values

ATTRIBUTE_BY_QNAME

public static int ATTRIBUTE_BY_QNAME

ATTRIBUTE_NOT_BY_QNAME

public static int ATTRIBUTE_NOT_BY_QNAME

TOKENIZE

public static int TOKENIZE

DO_NOT_TOKENIZE

public static int DO_NOT_TOKENIZE

TEXT_BY_QNAME

public static int TEXT_BY_QNAME

FOURTH_OPTION

public static int FOURTH_OPTION

LENGTH_NODE_TYPE

public static final int LENGTH_NODE_TYPE
See Also:
Constant Field Values

LENGTH_NODE_IDS_FREQ_OFFSETS

public static final int LENGTH_NODE_IDS_FREQ_OFFSETS
See Also:
Constant Field Values

OFFSET_NODE_TYPE

public static final int OFFSET_NODE_TYPE
See Also:
Constant Field Values

OFFSET_ELEMENT_CHILDREN_COUNT

public static final int OFFSET_ELEMENT_CHILDREN_COUNT
See Also:
Constant Field Values

OFFSET_ATTRIBUTE_DLN_LENGTH

public static final int OFFSET_ATTRIBUTE_DLN_LENGTH
See Also:
Constant Field Values

OFFSET_TEXT_DLN_LENGTH

public static final int OFFSET_TEXT_DLN_LENGTH
See Also:
Constant Field Values

OFFSET_DLN

public static final int OFFSET_DLN
See Also:
Constant Field Values

MAX_TOKEN_LENGTH

public static final int MAX_TOKEN_LENGTH
Length limit for the tokens

See Also:
Constant Field Values
Constructor Detail

NativeTextEngine

public NativeTextEngine(DBBroker broker,
                        BFile dbFile,
                        Configuration config)
                 throws DBException
Throws:
DBException
Method Detail

getFileName

public String getFileName()

getConfigKeyForFile

public String getConfigKeyForFile()

getInstance

public NativeTextEngine getInstance()

containsWildcards

public static final boolean containsWildcards(String str)
Checks if the given string could be a regular expression.

Parameters:
str - The string

startsWithWildcard

public static final boolean startsWithWildcard(String str)

getTrackMatches

public int getTrackMatches()
Overrides:
getTrackMatches in class TextSearchEngine

setTrackMatches

public void setTrackMatches(int flags)
Overrides:
setTrackMatches in class TextSearchEngine

setDocument

public void setDocument(DocumentImpl document)
Description copied from interface: ContentLoadingObserver
set the current document; generally called before calling an operation

Specified by:
setDocument in interface ContentLoadingObserver

storeAttribute

public void storeAttribute(AttrImpl node,
                           NodePath currentPath,
                           int indexingHint,
                           FulltextIndexSpec indexSpec,
                           boolean remove)
Indexes the tokens contained in an attribute.

Parameters:
node - The attribute to be indexed

storeAttribute

public void storeAttribute(AttrImpl node,
                           NodePath currentPath,
                           int indexingHint,
                           RangeIndexSpec idx,
                           boolean remove)
Description copied from interface: ContentLoadingObserver
store and index given attribute

Specified by:
storeAttribute in interface ContentLoadingObserver

storeText

public void storeText(TextImpl node,
                      int indexingHint,
                      FulltextIndexSpec indexSpec,
                      boolean remove)
Indexes the tokens contained in a text node.

Specified by:
storeText in class TextSearchEngine
Parameters:
indexSpec - The index configuration
node - The text node to be indexed
indexingHint - if true, given text is indexed as a single token if false, it is tokenized before being indexed

storeText

public void storeText(StoredNode parent,
                      ElementContent text,
                      int indexingHint,
                      FulltextIndexSpec indexSpec,
                      boolean remove)
Specified by:
storeText in class TextSearchEngine

storeText

public void storeText(TextImpl node,
                      NodePath currentPath,
                      int indexingHint)
Description copied from interface: ContentLoadingObserver
store and index given text node

Specified by:
storeText in interface ContentLoadingObserver

removeNode

public void removeNode(StoredNode node,
                       NodePath currentPath,
                       String content)
Description copied from interface: ContentLoadingObserver
The given node is being removed from the database.

Specified by:
removeNode in interface ContentLoadingObserver

sync

public void sync()
Description copied from interface: ContentLoadingObserver
triggers a cache sync, i.e. forces to write out all cached pages. sync() is called from time to time by the background sync daemon.

Specified by:
sync in interface ContentLoadingObserver

flush

public void flush()
Specified by:
flush in interface ContentLoadingObserver
Specified by:
flush in class TextSearchEngine

remove

public void remove()
Description copied from interface: ContentLoadingObserver
remove all pending modifications, for the current document.

Specified by:
remove in interface ContentLoadingObserver

dropIndex

public void dropIndex(Collection collection)
Description copied from class: TextSearchEngine
Remove index entries for an entire collection.

Specified by:
dropIndex in interface ContentLoadingObserver
Specified by:
dropIndex in class TextSearchEngine

dropIndex

public void dropIndex(DocumentImpl document)
Description copied from class: TextSearchEngine
Remove all index entries for the given document.

Specified by:
dropIndex in interface ContentLoadingObserver
Specified by:
dropIndex in class TextSearchEngine

getNodesContaining

public NodeSet getNodesContaining(XQueryContext context,
                                  DocumentSet docs,
                                  NodeSet contextSet,
                                  int axis,
                                  QName qname,
                                  String expr,
                                  int type,
                                  boolean matchAll)
                           throws TerminatedException
Description copied from class: TextSearchEngine
For each of the given search terms and each of the documents in the document set, return a node-set of matching nodes. The type-argument indicates if search terms should be compared using a regular expression. Valid values are DBBroker.MATCH_EXACT or DBBroker.MATCH_REGEXP.

Specified by:
getNodesContaining in class TextSearchEngine
Throws:
TerminatedException

getNodesExact

public NodeSet getNodesExact(XQueryContext context,
                             DocumentSet docs,
                             NodeSet contextSet,
                             int axis,
                             QName qname,
                             String expr)
                      throws TerminatedException
Get all nodes whose content exactly matches the give expression.

Throws:
TerminatedException

getNodes

public NodeSet getNodes(XQueryContext context,
                        DocumentSet docs,
                        NodeSet contextSet,
                        int axis,
                        QName qname,
                        TermMatcher matcher,
                        CharSequence startTerm)
                 throws TerminatedException
Specified by:
getNodes in class TextSearchEngine
Throws:
TerminatedException

getIndexTerms

public String[] getIndexTerms(DocumentSet docs,
                              TermMatcher matcher)
Specified by:
getIndexTerms in class TextSearchEngine

scanIndexTerms

public Occurrences[] scanIndexTerms(DocumentSet docs,
                                    NodeSet contextSet,
                                    String start,
                                    String end)
                             throws PermissionDeniedException
Description copied from class: TextSearchEngine
Queries the fulltext index to retrieve information on indexed words contained in the index for the current collection. Returns a list of Occurrences for all words contained in the index. If param end is null, all words starting with the string sequence param start are returned. Otherwise, the method returns all words that come after start and before end in lexical order.

Specified by:
scanIndexTerms in class TextSearchEngine
Throws:
PermissionDeniedException

scanIndexTerms

public Occurrences[] scanIndexTerms(DocumentSet docs,
                                    NodeSet contextSet,
                                    QName[] qnames,
                                    String start,
                                    String end)
                             throws PermissionDeniedException
Specified by:
scanIndexTerms in class TextSearchEngine
Throws:
PermissionDeniedException

closeAndRemove

public void closeAndRemove()
Specified by:
closeAndRemove in interface ContentLoadingObserver

close

public boolean close()
              throws DBException
Specified by:
close in interface ContentLoadingObserver
Specified by:
close in class TextSearchEngine
Throws:
DBException

printStatistics

public void printStatistics()
Specified by:
printStatistics in interface ContentLoadingObserver

toString

public String toString()
Overrides:
toString in class Object


Copyright (C) Wolfgang Meier. All rights reserved.