NepomukDaemons
Nepomuk::CLuceneTokenizer Class Reference
#include <clucenetokenizer.h>
Inheritance diagram for Nepomuk::CLuceneTokenizer:

Detailed Description
A grammar-based tokenizer constructed with JavaCC.This should be a good tokenizer for most European-language documents:
- Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
- Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
- Recognizes email addresses and internet hostnames as one token.
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.
Definition at line 54 of file clucenetokenizer.h.
Public Member Functions | |
CL_NS(util) | CLuceneTokenizer (CL_NS(util)::Reader *reader) |
bool | next (CL_NS(analysis)::Token *token) |
bool | ReadAlphaNum (const TCHAR prev, CL_NS(analysis)::Token *t) |
bool | ReadApostrophe (CL_NS(util)::StringBuffer *str, CL_NS(analysis)::Token *t) |
bool | ReadAt (CL_NS(util)::StringBuffer *str, CL_NS(analysis)::Token *t) |
bool | ReadCJK (const TCHAR prev, CL_NS(analysis)::Token *t) |
bool | ReadCompany (CL_NS(util)::StringBuffer *str, CL_NS(analysis)::Token *t) |
bool | ReadNumber (const TCHAR *previousNumber, const TCHAR prev, CL_NS(analysis)::Token *t) |
~CLuceneTokenizer () |
Constructor & Destructor Documentation
CL_NS (util) Nepomuk::CLuceneTokenizer::CLuceneTokenizer | ( | CL_NS(util)::Reader * | reader | ) |
Nepomuk::CLuceneTokenizer::~CLuceneTokenizer | ( | ) |
Definition at line 132 of file clucenetokenizer.cpp.
Member Function Documentation
bool Nepomuk::CLuceneTokenizer::next | ( | CL_NS(analysis)::Token * | token | ) |
Returns the next token in the stream, or false at end-of-stream.
The returned token's type is set to an element of CLuceneTokenizerConstants::tokenImage.
bool Nepomuk::CLuceneTokenizer::ReadAlphaNum | ( | const TCHAR | prev, | |
CL_NS(analysis)::Token * | t | |||
) |
bool Nepomuk::CLuceneTokenizer::ReadApostrophe | ( | CL_NS(util)::StringBuffer * | str, | |
CL_NS(analysis)::Token * | t | |||
) |
bool Nepomuk::CLuceneTokenizer::ReadAt | ( | CL_NS(util)::StringBuffer * | str, | |
CL_NS(analysis)::Token * | t | |||
) |
bool Nepomuk::CLuceneTokenizer::ReadCJK | ( | const TCHAR | prev, | |
CL_NS(analysis)::Token * | t | |||
) |
bool Nepomuk::CLuceneTokenizer::ReadCompany | ( | CL_NS(util)::StringBuffer * | str, | |
CL_NS(analysis)::Token * | t | |||
) |
bool Nepomuk::CLuceneTokenizer::ReadNumber | ( | const TCHAR * | previousNumber, | |
const TCHAR | prev, | |||
CL_NS(analysis)::Token * | t | |||
) |
The documentation for this class was generated from the following files: