RuleBasedCollator is a concrete subclass of Collator. It allows
customization of the Collator via user-specified rule sets.
RuleBasedCollator is designed to be fully compliant to the
Unicode
Collation Algorithm (UCA) and conforms to ISO 14651.
Users are strongly encouraged to read
the users guide for more information about the collation
service before using this class.
Create a RuleBasedCollator from a locale by calling the
getInstance(Locale) factory method in the base class Collator.
Collator.getInstance(Locale) creates a RuleBasedCollator object
based on the collation rules defined by the argument locale. If a
customized collation ordering ar attributes is required, use the
RuleBasedCollator(String) constructor with the appropriate
rules. The customized RuleBasedCollator will base its ordering on
UCA, while re-adjusting the attributes and orders of the characters
in the specified rule accordingly.
RuleBasedCollator provides correct collation orders for most
locales supported in ICU. If specific data for a locale is not
available, the orders eventually falls back to the
UCA collation
order .
For information about the collation rule syntax and details
about customization, please refer to the
Collation customization section of the user's guide.
Note that there are some differences between
the Collation rule syntax used in Java and ICU4J:
- According to the JDK documentation:
Modifier '!' : Turns on Thai/Lao vowel-consonant swapping. If this rule
is in force when a Thai vowel of the range \U0E40-\U0E44 precedes a
Thai consonant of the range \U0E01-\U0E2E OR a Lao vowel of the
range \U0EC0-\U0EC4 precedes a Lao consonant of the range
\U0E81-\U0EAE then the
vowel is placed after the consonant for collation purposes.
If a rule is without the modifier '!', the Thai/Lao vowel-consonant
swapping is not turned on.
ICU4J's RuleBasedCollator does not support turning off the Thai/Lao
vowel-consonant swapping, since the UCA clearly states that it has to be
supported to ensure a correct sorting order. If a '!' is encountered, it is
ignored.
- As mentioned in the documentation of the base class Collator,
compatibility decomposition mode is not supported.
Examples
Creating Customized RuleBasedCollators:
String simple = "& a < b < c < d";
RuleBasedCollator simpleCollator = new RuleBasedCollator(simple);
String norwegian = "& a , A < b , B < c , C < d , D < e , E "
+ "< f , F < g , G < h , H < i , I < j , "
+ "J < k , K < l , L < m , M < n , N < "
+ "o , O < p , P < q , Q < r , R < s , S < "
+ "t , T < u , U < v , V < w , W < x , X "
+ "< y , Y < z , Z < \u00E5 = a\u030A "
+ ", \u00C5 = A\u030A ; aa , AA < \u00E6 "
+ ", \u00C6 < \u00F8 , \u00D8";
RuleBasedCollator norwegianCollator = new RuleBasedCollator(norwegian);
Concatenating rules to combine
Collator
s:
// Create an en_US Collator object
RuleBasedCollator en_USCollator = (RuleBasedCollator)
Collator.getInstance(new Locale("en", "US", ""));
// Create a da_DK Collator object
RuleBasedCollator da_DKCollator = (RuleBasedCollator)
Collator.getInstance(new Locale("da", "DK", ""));
// Combine the two
// First, get the collation rules from en_USCollator
String en_USRules = en_USCollator.getRules();
// Second, get the collation rules from da_DKCollator
String da_DKRules = da_DKCollator.getRules();
RuleBasedCollator newCollator =
new RuleBasedCollator(en_USRules + da_DKRules);
// newCollator has the combined rules
Making changes to an existing RuleBasedCollator to create a new
Collator
object, by appending changes to the existing rule:
// Create a new Collator object with additional rules
String addRules = "& C < ch, cH, Ch, CH";
RuleBasedCollator myCollator =
new RuleBasedCollator(en_USCollator + addRules);
// myCollator contains the new rules
How to change the order of non-spacing accents:
// old rule with main accents
String oldRules = "= \u0301 ; \u0300 ; \u0302 ; \u0308 "
+ "; \u0327 ; \u0303 ; \u0304 ; \u0305 "
+ "; \u0306 ; \u0307 ; \u0309 ; \u030A "
+ "; \u030B ; \u030C ; \u030D ; \u030E "
+ "; \u030F ; \u0310 ; \u0311 ; \u0312 "
+ "< a , A ; ae, AE ; \u00e6 , \u00c6 "
+ "< b , B < c, C < e, E & C < d , D";
// change the order of accent characters
String addOn = "& \u0300 ; \u0308 ; \u0302";
RuleBasedCollator myCollator = new RuleBasedCollator(oldRules + addOn);
Putting in a new primary ordering before the default setting,
e.g. sort English characters before or after Japanese characters in the Japanese
Collator
:
// get en_US Collator rules
RuleBasedCollator en_USCollator
= (RuleBasedCollator)Collator.getInstance(Locale.US);
// add a few Japanese characters to sort before English characters
// suppose the last character before the first base letter 'a' in
// the English collation rule is \u2212
String jaString = "& \u2212 < \u3041, \u3042 < \u3043, "
+ "\u3044";
RuleBasedCollator myJapaneseCollator
= new RuleBasedCollator(en_USCollator.getRules() + jaString);
This class is not subclassable
clone
public Object clone()
throws CloneNotSupportedException
Clones the RuleBasedCollator
- clone in interface Collator
- a new instance of this RuleBasedCollator object
compare
public int compare(String source,
String target)
Compares the source text String to the target text String according to
the collation rules, strength and decomposition mode for this
RuleBasedCollator.
Returns an integer less than,
equal to or greater than zero depending on whether the source String is
less than, equal to or greater than the target String. See the Collator
class description for an example of use.
General recommendation:
If comparison are to be done to the same String multiple times, it would
be more efficient to generate CollationKeys for the Strings and use
CollationKey.compareTo(CollationKey) for the comparisons.
If speed performance is critical and object instantiation is to be
reduced, further optimization may be achieved by generating a simpler
key of the form RawCollationKey and reusing this RawCollationKey
object with the method RuleBasedCollator.getRawCollationKey. Internal
byte representation can be directly accessed via RawCollationKey and
stored for future use. Like CollationKey, RawCollationKey provides a
method RawCollationKey.compareTo for key comparisons.
If the each Strings are compared to only once, using the method
RuleBasedCollator.compare(String, String) will have a better performance.
- compare in interface Collator
source
- the source text String.target
- the target text String.
- Returns an integer value. Value is less than zero if source is
less than target, value is zero if source and target are equal,
value is greater than zero if source is greater than target.
equals
public boolean equals(Object obj)
Compares the equality of two RuleBasedCollator objects.
RuleBasedCollator objects are equal if they have the same collation
rules and the same attributes.
obj
- the RuleBasedCollator to be compared to.
- true if this RuleBasedCollator has exactly the same
collation behaviour as obj, false otherwise.
getCollationElementIterator
public CollationElementIterator getCollationElementIterator(CharacterIterator source)
Return a CollationElementIterator for the given CharacterIterator.
The source iterator's integrity will be preserved since a new copy
will be created for use.
getCollationElementIterator
public CollationElementIterator getCollationElementIterator(String source)
Return a CollationElementIterator for the given String.
getCollationElementIterator
public CollationElementIterator getCollationElementIterator(UCharacterIterator source)
Return a CollationElementIterator for the given UCharacterIterator.
The source iterator's integrity will be preserved since a new copy
will be created for use.
getCollationKey
public CollationKey getCollationKey(String source)
Get a Collation key for the argument String source from this
RuleBasedCollator.
General recommendation:
If comparison are to be done to the same String multiple times, it would
be more efficient to generate CollationKeys for the Strings and use
CollationKey.compareTo(CollationKey) for the comparisons.
If the each Strings are compared to only once, using the method
RuleBasedCollator.compare(String, String) will have a better performance.
See the class documentation for an explanation about CollationKeys.
- getCollationKey in interface Collator
source
- the text String to be transformed into a collation key.
- the CollationKey for the given String based on this
RuleBasedCollator's collation rules. If the source String is
null, a null CollationKey is returned.
getContractionsAndExpansions
public void getContractionsAndExpansions(UnicodeSet contractions,
UnicodeSet expansions,
boolean addPrefixes)
throws Exception
Gets unicode sets containing contractions and/or expansions of a collator
contractions
- if not null, set to contain contractionsexpansions
- if not null, set to contain expansionsaddPrefixes
- add the prefix contextual elements to contractions
getNumericCollation
public boolean getNumericCollation()
Method to retrieve the numeric collation value.
When numeric collation is turned on, this Collator generates a collation
key for the numeric value of substrings of digits. This is a way to get
'100' to sort AFTER '2'
- true if numeric collation is turned on, false otherwise
getRawCollationKey
public RawCollationKey getRawCollationKey(String source,
RawCollationKey key)
Gets the simpler form of a CollationKey for the String source following
the rules of this Collator and stores the result into the user provided
argument key.
If key has a internal byte array of length that's too small for the
result, the internal byte array will be grown to the exact required
size.
- getRawCollationKey in interface Collator
source
- the text String to be transformed into a RawCollationKeykey
- output RawCollationKey to store results
- If key is null, a new instance of RawCollationKey will be
created and returned, otherwise the user provided key will be
returned.
getRules
public String getRules()
Gets the collation rules for this RuleBasedCollator.
Equivalent to String getRules(RuleOption.FULL_RULES).
- returns the collation rules
getRules
public String getRules(boolean fullrules)
Returns current rules. The argument defines whether full rules
(UCA + tailored) rules are returned or just the tailoring.
fullrules
- true if the rules that defines the full set of
collation order is required, otherwise false for returning only
the tailored rules
- the current rules that defines this Collator.
getTailoredSet
public UnicodeSet getTailoredSet()
Get an UnicodeSet that contains all the characters and sequences
tailored in this collator.
- getTailoredSet in interface Collator
- a pointer to a UnicodeSet object containing all the
code points and sequences that may sort differently than
in the UCA.
getUCAVersion
public VersionInfo getUCAVersion()
Get the UCA version of this collator object.
- getUCAVersion in interface Collator
- the version object associated with this collator
getVariableTop
public int getVariableTop()
Gets the variable top value of a Collator.
Lower 16 bits are undefined and should be ignored.
- getVariableTop in interface Collator
- the variable top value of a Collator.
getVersion
public VersionInfo getVersion()
Get the version of this collator object.
- getVersion in interface Collator
- the version object associated with this collator
hashCode
public int hashCode()
Generates a unique hash code for this RuleBasedCollator.
- the unique hash code for this Collator
isAlternateHandlingShifted
public boolean isAlternateHandlingShifted()
Checks if the alternate handling behaviour is the UCA defined SHIFTED or
NON_IGNORABLE.
If return value is true, then the alternate handling attribute for the
Collator is SHIFTED. Otherwise if return value is false, then the
alternate handling attribute for the Collator is NON_IGNORABLE
See setAlternateHandlingShifted(boolean) for more details.
isCaseLevel
public boolean isCaseLevel()
Checks if case level is set to true.
See setCaseLevel(boolean) for details.
isFrenchCollation
public boolean isFrenchCollation()
Checks if French Collation is set to true.
See setFrenchCollation(boolean) for details.
- true if French Collation is set to true, false otherwise
isHiraganaQuaternary
public boolean isHiraganaQuaternary()
Checks if the Hiragana Quaternary mode is set on.
See setHiraganaQuaternary(boolean) for more details.
- flag true if Hiragana Quaternary mode is on, false otherwise
isLowerCaseFirst
public boolean isLowerCaseFirst()
Return true if a lowercase character is sorted before the corresponding uppercase character.
See setCaseFirst(boolean) for details.
- true lower cased characters are sorted before upper cased
characters, false otherwise
isUpperCaseFirst
public boolean isUpperCaseFirst()
Return true if an uppercase character is sorted before the corresponding lowercase character.
See setCaseFirst(boolean) for details.
- true if upper cased characters are sorted before lower cased
characters, false otherwise
setAlternateHandlingDefault
public void setAlternateHandlingDefault()
Sets the alternate handling mode to the initial mode set during
construction of the RuleBasedCollator.
See setAlternateHandling(boolean) for more details.
setAlternateHandlingShifted
public void setAlternateHandlingShifted(boolean shifted)
Sets the alternate handling for QUATERNARY strength to be either
shifted or non-ignorable.
See the UCA definition on
Alternate Weighting.
This attribute will only be effective when QUATERNARY strength is set.
The default value for this mode is false, corresponding to the
NON_IGNORABLE mode in UCA. In the NON-IGNORABLE mode, the
RuleBasedCollator will treats all the codepoints with non-ignorable
primary weights in the same way.
If the mode is set to true, the behaviour corresponds to SHIFTED defined
in UCA, this causes codepoints with PRIMARY orders that are equal or
below the variable top value to be ignored in PRIMARY order and
moved to the QUATERNARY order.
shifted
- true if SHIFTED behaviour for alternate handling is
desired, false for the NON_IGNORABLE behaviour.
setCaseFirstDefault
public final void setCaseFirstDefault()
Sets the case first mode to the initial mode set during
construction of the RuleBasedCollator.
See setUpperCaseFirst(boolean) and setLowerCaseFirst(boolean) for more
details.
setCaseLevel
public void setCaseLevel(boolean flag)
When case level is set to true, an additional weight is formed
between the SECONDARY and TERTIARY weight, known as the case level.
The case level is used to distinguish large and small Japanese Kana
characters. Case level could also be used in other situations.
For example to distinguish certain Pinyin characters.
The default value is false, which means the case level is not generated.
The contents of the case level are affected by the case first
mode. A simple way to ignore accent differences in a string is to set
the strength to PRIMARY and enable case level.
See the section on
case level for more information.
flag
- true if case level sorting is required, false otherwise
setCaseLevelDefault
public void setCaseLevelDefault()
Sets the case level mode to the initial mode set during
construction of the RuleBasedCollator.
See setCaseLevel(boolean) for more details.
setDecompositionDefault
public void setDecompositionDefault()
Sets the decomposition mode to the initial mode set during construction
of the RuleBasedCollator.
See setDecomposition(int) for more details.
setFrenchCollation
public void setFrenchCollation(boolean flag)
Sets the mode for the direction of SECONDARY weights to be used in
French collation.
The default value is false, which treats SECONDARY weights in the order
they appear.
If set to true, the SECONDARY weights will be sorted backwards.
See the section on
French collation for more information.
flag
- true to set the French collation on, false to set it off
setFrenchCollationDefault
public void setFrenchCollationDefault()
Sets the French collation mode to the initial mode set during
construction of the RuleBasedCollator.
See setFrenchCollation(boolean) for more details.
setHiraganaQuaternary
public void setHiraganaQuaternary(boolean flag)
Sets the Hiragana Quaternary mode to be on or off.
When the Hiragana Quaternary mode is turned on, the collator
positions Hiragana characters before all non-ignorable characters in
QUATERNARY strength. This is to produce a correct JIS collation order,
distinguishing between Katakana and Hiragana characters.
flag
- true if Hiragana Quaternary mode is to be on, false
otherwise
setHiraganaQuaternaryDefault
public void setHiraganaQuaternaryDefault()
Sets the Hiragana Quaternary mode to the initial mode set during
construction of the RuleBasedCollator.
See setHiraganaQuaternary(boolean) for more details.
setLowerCaseFirst
public void setLowerCaseFirst(boolean lowerfirst)
Sets the orders of lower cased characters to sort before upper cased
characters, in strength TERTIARY. The default
mode is false.
If true is set, the RuleBasedCollator will sort lower cased characters
before the upper cased ones.
Otherwise, if false is set, the RuleBasedCollator will ignore case
preferences.
lowerfirst
- true for sorting lower cased characters before
upper cased characters, false to ignore case
preferences.
setNumericCollation
public void setNumericCollation(boolean flag)
When numeric collation is turned on, this Collator generates a collation
key for the numeric value of substrings of digits. This is a way to get
'100' to sort AFTER '2'
flag
- true to turn numeric collation on and false to turn it off
setNumericCollationDefault
public void setNumericCollationDefault()
Method to set numeric collation to its default value.
When numeric collation is turned on, this Collator generates a collation
key for the numeric value of substrings of digits. This is a way to get
'100' to sort AFTER '2'
setStrength
public void setStrength(int newStrength)
Sets this Collator's strength property. The strength property
determines the minimum level of difference considered significant
during comparison.
See the Collator class description for an example of use.
- setStrength in interface Collator
newStrength
- the new strength value.
setStrengthDefault
public void setStrengthDefault()
Sets the collation strength to the initial mode set during the
construction of the RuleBasedCollator.
See setStrength(int) for more details.
setUpperCaseFirst
public void setUpperCaseFirst(boolean upperfirst)
Sets whether uppercase characters sort before lowercase
characters or vice versa, in strength TERTIARY. The default
mode is false, and so lowercase characters sort before uppercase
characters.
If true, sort upper case characters first.
upperfirst
- true to sort uppercase characters before
lowercase characters, false to sort lowercase
characters before uppercase characters
setVariableTop
public int setVariableTop(String varTop)
Variable top is a two byte primary value which causes all the codepoints
with primary values that are less or equal than the variable top to be
shifted when alternate handling is set to SHIFTED.
Sets the variable top to a collation element value of a string supplied.
- setVariableTop in interface Collator
varTop
- one or more (if contraction) characters to which the
variable top should be set
- a int value containing the value of the variable top in upper 16
bits. Lower 16 bits are undefined.
setVariableTop
public void setVariableTop(int varTop)
Sets the variable top to a collation element value supplied.
Variable top is set to the upper 16 bits.
Lower 16 bits are ignored.
- setVariableTop in interface Collator
varTop
- Collation element value, as returned by setVariableTop or
getVariableTop