org.apache.solr.analysis
Class CommonGramsFilter

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.TokenFilter
              extended by org.apache.solr.analysis.BufferedTokenStream
                  extended by org.apache.solr.analysis.CommonGramsFilter

public class CommonGramsFilter
extends BufferedTokenStream

Construct bigrams for frequently occurring terms while indexing. Single terms are still indexed too, with bigrams overlaid. This is achieved through the use of Token.setPositionIncrement(int). Bigrams have a type of "gram" Example


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
 
Constructor Summary
CommonGramsFilter(TokenStream input, Set commonWords)
          Construct a token stream filtering the given input using a Set of common words to create bigrams.
CommonGramsFilter(TokenStream input, Set commonWords, boolean ignoreCase)
          Construct a token stream filtering the given input using a Set of common words to create bigrams, case-sensitive if ignoreCase is false (unless Set is CharArraySet).
CommonGramsFilter(TokenStream input, String[] commonWords)
          Construct a token stream filtering the given input using an Array of common words to create bigrams.
CommonGramsFilter(TokenStream input, String[] commonWords, boolean ignoreCase)
          Construct a token stream filtering the given input using an Array of common words to create bigrams and is case-sensitive if ignoreCase is false.
 
Method Summary
 void init()
           
static CharArraySet makeCommonSet(String[] commonWords)
          Build a CharArraySet from an array of common words, appropriate for passing into the CommonGramsFilter constructor.
static CharArraySet makeCommonSet(String[] commonWords, boolean ignoreCase)
          Build a CharArraySet from an array of common words, appropriate for passing into the CommonGramsFilter constructor,case-sensitive if ignoreCase is false.
 Token process(Token token)
          Inserts bigrams for common words into a token stream.
 void reset()
           
 
Methods inherited from class org.apache.solr.analysis.BufferedTokenStream
next, output, peek, pushBack, read, write
 
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end
 
Methods inherited from class org.apache.lucene.analysis.TokenStream
getOnlyUseNewAPI, incrementToken, next, setOnlyUseNewAPI
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

CommonGramsFilter

public CommonGramsFilter(TokenStream input,
                         Set commonWords)
Construct a token stream filtering the given input using a Set of common words to create bigrams. Outputs both unigrams with position increment and bigrams with position increment 0 type=gram where one or both of the words in a potential bigram are in the set of common words .

Parameters:
input - TokenStream input in filter chain
commonWords - The set of common words.

CommonGramsFilter

public CommonGramsFilter(TokenStream input,
                         Set commonWords,
                         boolean ignoreCase)
Construct a token stream filtering the given input using a Set of common words to create bigrams, case-sensitive if ignoreCase is false (unless Set is CharArraySet). If commonWords is an instance of CharArraySet (true if makeCommonSet() was used to construct the set) it will be directly used and ignoreCase will be ignored since CharArraySet directly controls case sensitivity.

If commonWords is not an instance of CharArraySet, a new CharArraySet will be constructed and ignoreCase will be used to specify the case sensitivity of that set.

Parameters:
input - TokenStream input in filter chain.
commonWords - The set of common words.
ignoreCase - -Ignore case when constructing bigrams for common words.

CommonGramsFilter

public CommonGramsFilter(TokenStream input,
                         String[] commonWords)
Construct a token stream filtering the given input using an Array of common words to create bigrams.

Parameters:
input - Tokenstream in filter chain
commonWords - words to be used in constructing bigrams

CommonGramsFilter

public CommonGramsFilter(TokenStream input,
                         String[] commonWords,
                         boolean ignoreCase)
Construct a token stream filtering the given input using an Array of common words to create bigrams and is case-sensitive if ignoreCase is false.

Parameters:
input - Tokenstream in filter chain
commonWords - words to be used in constructing bigrams
ignoreCase - -Ignore case when constructing bigrams for common words.
Method Detail

init

public void init()

makeCommonSet

public static final CharArraySet makeCommonSet(String[] commonWords)
Build a CharArraySet from an array of common words, appropriate for passing into the CommonGramsFilter constructor. This permits this commonWords construction to be cached once when an Analyzer is constructed.

See Also:
passing false to ignoreCase

makeCommonSet

public static final CharArraySet makeCommonSet(String[] commonWords,
                                               boolean ignoreCase)
Build a CharArraySet from an array of common words, appropriate for passing into the CommonGramsFilter constructor,case-sensitive if ignoreCase is false.

Parameters:
commonWords -
ignoreCase - If true, all words are lower cased first.
Returns:
a Set containing the words

process

public Token process(Token token)
              throws IOException
Inserts bigrams for common words into a token stream. For each input token, output the token. If the token and/or the following token are in the list of common words also output a bigram with position increment 0 and type="gram"

Specified by:
process in class BufferedTokenStream
Throws:
IOException

reset

public void reset()
           throws IOException
Overrides:
reset in class BufferedTokenStream
Throws:
IOException


Copyright © 2011 Apache Software Foundation. All Rights Reserved.