org.apache.lucene.analysis.ru

Class RussianLetterTokenizer


public class RussianLetterTokenizer
extends CharTokenizer

A RussianLetterTokenizer is a tokenizer that extends LetterTokenizer by additionally looking up letters in a given "russian charset". The problem with LeterTokenizer is that it uses Character.isLetter() method, which doesn't know how to detect letters in encodings like CP1252 and KOI8 (well-known problems with 0xD7 and 0xF7 chars)
Version:
$Id: RussianLetterTokenizer.java 150998 2004-08-16 20:30:46Z dnaber $
Author:
Boris Okner, b.okner@rogers.com

Field Summary

Fields inherited from class org.apache.lucene.analysis.Tokenizer

input

Constructor Summary

RussianLetterTokenizer(Reader in, char[] charset)

Method Summary

protected boolean
isTokenChar(char c)
Collects only characters which satisfy Character.isLetter(char).

Methods inherited from class org.apache.lucene.analysis.CharTokenizer

isTokenChar, next, normalize

Methods inherited from class org.apache.lucene.analysis.Tokenizer

close

Methods inherited from class org.apache.lucene.analysis.TokenStream

close, next

Constructor Details

RussianLetterTokenizer

public RussianLetterTokenizer(Reader in,
                              char[] charset)

Method Details

isTokenChar

protected boolean isTokenChar(char c)
Collects only characters which satisfy Character.isLetter(char).
Overrides:
isTokenChar in interface CharTokenizer

Copyright © 2000-2007 Apache Software Foundation. All Rights Reserved.