tokenizers for Whoosh designed for Japanese language
Project description
About
Tokenizers for Whoosh full text search library designed for Japanese language. This package conteins two Tokenizers.
IgoTokenizer
requires igo-python(http://pypi.python.org/pypi/igo-python/) and its dictionary.
TinySegmenterTokenizer
requires TinySegmenter in Python(https://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/tinysegmenter.py)
MeCabTokenizer
requires MeCab python binding(http://mecab.sourceforge.net/bindings.html)
How To Use
IgoTokenizer:
import igo.Tagger import whooshjp from whooshjp.IgoTokenizer import IgoTokenizer tk = IgoTokenizer(igo.Tagger.Tagger('ipadic')) scm = Schema(title=TEXT(stored=True, analyzer=tk), path=ID(unique=True,stored=True), content=TEXT(analyzer=tk))
TinySegmenterTokenizer:
import tinysegmenter import whooshjp from whooshjp.TinySegmenterTokenizer import TinySegmenterTokenizer tk = TinySegmenterTokenizer(tinysegmenter.TinySegmenter()) scm = Schema(title=TEXT(stored=True, analyzer=tk), path=ID(unique=True,stored=True), content=TEXT(analyzer=tk))
Changelog for Japanese Tokenizers for Whoosh
- 2011-02-19 – 0.1
first release.
- 2011-02-21 – 0.2
add TinySegmenterTokenizer
change module name
- 2011-02-24 – 0.3
add FeatureFilter
- 2011-02-27 – 0.4
add MeCabTokenizer
add a mode for don’t pickle igo tagger to minimize index.
- 2011-04-17 – 0.5
correct char offsets
- 2011-04-17 – 0.6
correct char offsets(TinySegmenterTokenizer)
- 2012-04-14 – 0.7
rename package(WhooshJapaneseTokenizer to whooshjp)
no longer import sub modules automatically
Python3 compatibility(3.2, 3.3)
Drop Python2.5 support