mnoGoSearch 3.2.18 reference manual: Full-featured search engine software | ||
---|---|---|
Prev | Chapter 7. Languages support | Next |
Traditional Chinese, Thai and Japanese writing have no spaces between words in phrase as in western languages. Thus, while indexing documents in these languages, it's need additionaly to segment phrases into words.
For japanes language phrase segmenting the one of ChaSen, a morphological system for japanes language or MeCab: Japanese morphological analyser is used. Thus, you need one of these systems to be installed before mnoGoSearch's configuring and building.
To enable Japanese language phrase segmenting use --enable-chasen or --enable-mecab switch for configure.
For Chinese language phrase segmenting the frequency dictionary of Chinese words is used. And segmenting itself is done by dynamic programming method to maximize the cumulative frequency of produced words.
To enable Chinese language phrase segmenting it's need to enable while mnoGoSearch configuring the GB2312 charset support, if mandarin.freq, a simplified Chinese dictionary will be used, or enable the Big5 charset support, if TraditionalChinese.freq, a traditional Chinese dictionary will be used, and specify frequency dictionary of Chinese words by LoadChineseList in indexer.conf file.
LoadChineseList [charset dictionaryfilename]
By default, the GB2312charset and mandarin.freqdictionary is used.
For Thai language phrase segmenting the frequency dictionary of Thai words is used. And segmenting itself is done as for Chinese language.
To enable Thailanguage phrase segmenting it's need specify frequency dictionary of Thai words by LoadThaiList in indexer.conf file.
LoadThaiList [charset dictionaryfilename]
By default, the tis-620charset and thai.freqdictionary is used.