com.ibm.icu.text

Class RuleBasedBreakIterator

Implemented Interfaces:
Cloneable
Known Direct Subclasses:
RuleBasedBreakIterator_New, RuleBasedBreakIterator_Old

public class RuleBasedBreakIterator
extends BreakIterator

A subclass of BreakIterator whose behavior is specified using a list of rules.

Field Summary

static int
WORD_IDEO
Tag value for words containing ideographic characters, lower limit
static int
WORD_IDEO_LIMIT
Tag value for words containing ideographic characters, upper limit
static int
WORD_KANA
Tag value for words containing kana characters, lower limit
static int
WORD_KANA_LIMIT
Tag value for words containing kana characters, upper limit
static int
WORD_LETTER
Tag value for words that contain letters, excluding hiragana, katakana or ideographic characters, lower limit.
static int
WORD_LETTER_LIMIT
Tag value for words containing letters, upper limit
static int
WORD_NONE
Tag value for "words" that do not fit into any of other categories.
static int
WORD_NONE_LIMIT
Upper bound for tags for uncategorized words.
static int
WORD_NUMBER
Tag value for words that appear to be numbers, lower limit.
static int
WORD_NUMBER_LIMIT
Tag value for words that appear to be numbers, upper limit.

Fields inherited from class com.ibm.icu.text.BreakIterator

DONE, KIND_CHARACTER, KIND_LINE, KIND_SENTENCE, KIND_TITLE, KIND_WORD

Constructor Summary

RuleBasedBreakIterator()
This default constructor is used when creating derived classes of RulesBasedBreakIterator.
RuleBasedBreakIterator(String description)
Constructs a RuleBasedBreakIterator_Old according to the description provided.

Method Summary

Object
clone()
Clones this iterator.
int
current()
Returns the current iteration position.
boolean
equals(Object that)
Returns true if both BreakIterators are of the same class, have the same rules, and iterate over the same text.
int
first()
Sets the current iteration position to the beginning of the text.
int
following(int offset)
Sets the iterator to refer to the first boundary position following the specified position.
static RuleBasedBreakIterator
getInstanceFromCompiledRules(InputStream is)
Get a break iterator based on a set of pre-compiled break rules.
int
getRuleStatus()
Return the status tag from the break rule that determined the most recently returned break position.
int
getRuleStatusVec(int[] fillInArray)
Get the status (tag) values from the break rule(s) that determined the most recently returned break position.
CharacterIterator
getText()
Return a CharacterIterator over the text being analyzed.
int
hashCode()
Compute a hashcode for this BreakIterator
boolean
isBoundary(int offset)
Returns true if the specfied position is a boundary position.
int
last()
Sets the current iteration position to the end of the text.
int
next()
Advances the iterator to the next boundary position.
int
next(int n)
Advances the iterator either forward or backward the specified number of steps.
int
preceding(int offset)
Sets the iterator to refer to the last boundary position before the specified position.
int
previous()
Advances the iterator backwards, to the last boundary preceding this one.
void
setText(CharacterIterator newText)
Set the iterator to analyze a new piece of text.
String
toString()
Returns the description used to create this iterator

Methods inherited from class com.ibm.icu.text.BreakIterator

clone, current, first, following, getAvailableLocales, getAvailableULocales, getCharacterInstance, getCharacterInstance, getCharacterInstance, getLineInstance, getLineInstance, getLineInstance, getLocale, getSentenceInstance, getSentenceInstance, getSentenceInstance, getText, getTitleInstance, getTitleInstance, getTitleInstance, getWordInstance, getWordInstance, getWordInstance, isBoundary, last, next, next, preceding, previous, registerInstance, registerInstance, setText, setText, unregister

Field Details

WORD_IDEO

public static final int WORD_IDEO
Tag value for words containing ideographic characters, lower limit
Field Value:
400

WORD_IDEO_LIMIT

public static final int WORD_IDEO_LIMIT
Tag value for words containing ideographic characters, upper limit
Field Value:
500

WORD_KANA

public static final int WORD_KANA
Tag value for words containing kana characters, lower limit
Field Value:
300

WORD_KANA_LIMIT

public static final int WORD_KANA_LIMIT
Tag value for words containing kana characters, upper limit
Field Value:
400

WORD_LETTER

public static final int WORD_LETTER
Tag value for words that contain letters, excluding hiragana, katakana or ideographic characters, lower limit.
Field Value:
200

WORD_LETTER_LIMIT

public static final int WORD_LETTER_LIMIT
Tag value for words containing letters, upper limit
Field Value:
300

WORD_NONE

public static final int WORD_NONE
Tag value for "words" that do not fit into any of other categories. Includes spaces and most punctuation.
Field Value:
0

WORD_NONE_LIMIT

public static final int WORD_NONE_LIMIT
Upper bound for tags for uncategorized words.
Field Value:
100

WORD_NUMBER

public static final int WORD_NUMBER
Tag value for words that appear to be numbers, lower limit.
Field Value:
100

WORD_NUMBER_LIMIT

public static final int WORD_NUMBER_LIMIT
Tag value for words that appear to be numbers, upper limit.
Field Value:
200

Constructor Details

RuleBasedBreakIterator

protected RuleBasedBreakIterator()
This default constructor is used when creating derived classes of RulesBasedBreakIterator. Not intended for use by normal clients of break iterators.

RuleBasedBreakIterator

public RuleBasedBreakIterator(String description)
Constructs a RuleBasedBreakIterator_Old according to the description provided. If the description is malformed, throws an IllegalArgumentException. Normally, instead of constructing a RuleBasedBreakIterator_Old directory, you'll use the factory methods on BreakIterator to create one indirectly from a description in the framework's resource files. You'd use this when you want special behavior not provided by the built-in iterators.

Method Details

clone

public Object clone()
Clones this iterator.
Overrides:
clone in interface BreakIterator
Returns:
A newly-constructed RuleBasedBreakIterator with the same behavior as this one.

current

public int current()
Returns the current iteration position.
Overrides:
current in interface BreakIterator
Returns:
The current iteration position.

equals

public boolean equals(Object that)
Returns true if both BreakIterators are of the same class, have the same rules, and iterate over the same text.

first

public int first()
Sets the current iteration position to the beginning of the text. (i.e., the CharacterIterator's starting offset).
Overrides:
first in interface BreakIterator
Returns:
The offset of the beginning of the text.

following

public int following(int offset)
Sets the iterator to refer to the first boundary position following the specified position.
Overrides:
following in interface BreakIterator
Parameters:
offset - The position from which to begin searching for a break position.
Returns:
The position of the first break after the current position.

getInstanceFromCompiledRules

public static RuleBasedBreakIterator getInstanceFromCompiledRules(InputStream is)
            throws IOException
Get a break iterator based on a set of pre-compiled break rules.
Parameters:
is - An input stream that supplies the compiled rule data. The format of the rule data on the stream is that of a rule data file produced by the ICU4C tool "genbrk".
Returns:
A RuleBasedBreakIterator based on the supplied break rules.

getRuleStatus

public int getRuleStatus()
Return the status tag from the break rule that determined the most recently returned break position. The values appear in the rule source within brackets, {123}, for example. For rules that do not specify a status, a default value of 0 is returned. If more than one rule applies, the numerically largest of the possible status values is returned.

The values used by the standard ICU break rules are defined as constants in this class, and allow distinguishing between words that contain alphabetic letters, "words" that appear to be numbers, punctuation and spaces, words containing ideographic characters, and more. Call getRuleStatus after obtaining a boundary position from next(), previous(), or any other break iterator functions that returns a boundary position.

Returns:
the status from the break rule that determined the most recently returned break position.

getRuleStatusVec

public int getRuleStatusVec(int[] fillInArray)
Get the status (tag) values from the break rule(s) that determined the most recently returned break position. The values appear in the rule source within brackets, {123}, for example. The default status value for rules that do not explicitly provide one is zero.

The values used by the standard ICU rules are defined as contants in this class.

If the size of the output array is insufficient to hold the data, the output will be truncated to the available length. No exception will be thrown.

Parameters:
fillInArray - an array to be filled in with the status values.
Returns:
The number of rule status values from rules that determined the most recent boundary returned by the break iterator. In the event that the array is too small, the return value is the total number of status values that were available, not the reduced number that were actually returned.

getText

public CharacterIterator getText()
Return a CharacterIterator over the text being analyzed. This version of this method returns the actual CharacterIterator we're using internally. Changing the state of this iterator can have undefined consequences. If you need to change it, clone it first.
Overrides:
getText in interface BreakIterator
Returns:
An iterator over the text being analyzed.

hashCode

public int hashCode()
Compute a hashcode for this BreakIterator
Returns:
A hash code

isBoundary

public boolean isBoundary(int offset)
Returns true if the specfied position is a boundary position. As a side effect, leaves the iterator pointing to the first boundary position at or after "offset".
Overrides:
isBoundary in interface BreakIterator
Parameters:
offset - the offset to check.
Returns:
True if "offset" is a boundary position.

last

public int last()
Sets the current iteration position to the end of the text. (i.e., the CharacterIterator's ending offset).
Overrides:
last in interface BreakIterator
Returns:
The text's past-the-end offset.

next

public int next()
Advances the iterator to the next boundary position.
Overrides:
next in interface BreakIterator
Returns:
The position of the first boundary after this one.

next

public int next(int n)
Advances the iterator either forward or backward the specified number of steps. Negative values move backward, and positive values move forward. This is equivalent to repeatedly calling next() or previous().
Overrides:
next in interface BreakIterator
Parameters:
n - The number of steps to move. The sign indicates the direction (negative is backwards, and positive is forwards).
Returns:
The character offset of the boundary position n boundaries away from the current one.

preceding

public int preceding(int offset)
Sets the iterator to refer to the last boundary position before the specified position.
Overrides:
preceding in interface BreakIterator
Parameters:
offset - The position to begin searching for a break from.
Returns:
The position of the last boundary before the starting position.

previous

public int previous()
Advances the iterator backwards, to the last boundary preceding this one.
Overrides:
previous in interface BreakIterator
Returns:
The position of the last boundary position preceding this one.

setText

public void setText(CharacterIterator newText)
Set the iterator to analyze a new piece of text. This function resets the current iteration position to the beginning of the text.
Overrides:
setText in interface BreakIterator
Parameters:
newText - An iterator over the text to analyze.

toString

public String toString()
Returns the description used to create this iterator

Copyright (c) 2006 IBM Corporation and others.