com.ibm.icu.text
Class DictionaryBasedBreakIterator
- Cloneable
public class DictionaryBasedBreakIterator
A subclass of RuleBasedBreakIterator_Old that adds the ability to use a dictionary
to further subdivide ranges of text beyond what is possible using just the
state-table-based algorithm. This is necessary, for example, to handle
word and line breaking in Thai, which doesn't use spaces between words. The
state-table-based algorithm used by RuleBasedBreakIterator_Old is used to divide
up text as far as possible, and then contiguous ranges of letters are
repeatedly compared against a list of known words (i.e., the dictionary)
to divide them up into words.
DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator_Old,
but adds one more special substitution name: _dictionary_. This substitution
name is used to identify characters in words in the dictionary. The idea is that
if the iterator passes over a chunk of text that includes two or more characters
in a row that are included in _dictionary_, it goes back through that range and
derives additional break positions (if possible) using the dictionary.
DictionaryBasedBreakIterator is also constructed with the filename of a dictionary
file. It uses Class.getResource() to locate the dictionary file. The
dictionary file is in a serialized binary format. We have a very primitive (and
slow) BuildDictionaryFile utility for creating dictionary files, but aren't
currently making it public. Contact us for help.
protected class | DictionaryBasedBreakIterator.Builder - The Builder class for DictionaryBasedBreakIterator inherits almost all of
its functionality from the Builder class for RuleBasedBreakIterator_Old, but
extends it with extra logic to handle the DICTIONARY_VAR token
|
int | first() - Sets the current iteration position to the beginning of the text.
|
int | following(int offset) - Sets the current iteration position to the first boundary position after
the specified position.
|
protected int | handleNext() - This is the implementation function for next().
|
int | last() - Sets the current iteration position to the end of the text.
|
protected int | lookupCategory(char c) - Looks up a character category for a character.
|
protected RuleBasedBreakIterator_Old.Builder | makeBuilder() - Returns a Builder that is customized to build a DictionaryBasedBreakIterator.
|
int | preceding(int offset) - Sets the current iteration position to the last boundary position
before the specified position.
|
int | previous() - Advances the iterator one step backwards.
|
void | setText(CharacterIterator newText)
|
void | writeTablesToFile(FileOutputStream file, boolean littleEndian)
|
checkOffset , clone , current , debugDumpTables , debugPrintln , equals , first , following , getRuleStatus , getRuleStatusVec , getText , handleNext , handlePrevious , hashCode , isBoundary , last , lookupBackwardState , lookupCategory , lookupState , makeBuilder , next , next , preceding , previous , setText , toString , writeSwappedInt , writeSwappedShort , writeTablesToFile |
clone , current , equals , first , following , getInstanceFromCompiledRules , getRuleStatus , getRuleStatusVec , getText , hashCode , isBoundary , last , next , next , preceding , previous , setText , toString |
clone , current , first , following , getAvailableLocales , getAvailableULocales , getCharacterInstance , getCharacterInstance , getCharacterInstance , getLineInstance , getLineInstance , getLineInstance , getLocale , getSentenceInstance , getSentenceInstance , getSentenceInstance , getText , getTitleInstance , getTitleInstance , getTitleInstance , getWordInstance , getWordInstance , getWordInstance , isBoundary , last , next , next , preceding , previous , registerInstance , registerInstance , setText , setText , unregister |
DictionaryBasedBreakIterator
public DictionaryBasedBreakIterator(String description,
InputStream dictionaryStream)
throws IOException
Constructs a DictionaryBasedBreakIterator.
description
- Same as the description parameter on RuleBasedBreakIterator_Old,
except for the special meaning of DICTIONARY_VAR. This parameter is just
passed through to RuleBasedBreakIterator_Old's constructor.dictionaryStream
- the stream containing the dictionary data
first
public int first()
Sets the current iteration position to the beginning of the text.
(i.e., the CharacterIterator's starting offset).
- first in interface RuleBasedBreakIterator_Old
- The offset of the beginning of the text.
following
public int following(int offset)
Sets the current iteration position to the first boundary position after
the specified position.
- following in interface RuleBasedBreakIterator_Old
offset
- The position to begin searching forward from
- The position of the first boundary after "offset"
last
public int last()
Sets the current iteration position to the end of the text.
(i.e., the CharacterIterator's ending offset).
- last in interface RuleBasedBreakIterator_Old
- The text's past-the-end offset.
preceding
public int preceding(int offset)
Sets the current iteration position to the last boundary position
before the specified position.
- preceding in interface RuleBasedBreakIterator_Old
offset
- The position to begin searching from
- The position of the last boundary before "offset"
previous
public int previous()
Advances the iterator one step backwards.
- previous in interface RuleBasedBreakIterator_Old
- The position of the last boundary position before the
current iteration position
Copyright (c) 2006 IBM Corporation and others.