skip to navigation
skip to content

whoosh-igo 0.7

tokenizers for Whoosh designed for Japanese language

About

Tokenizers for Whoosh full text search library designed for Japanese language. This package conteins two Tokenizers.

  • IgoTokenizer
  • TinySegmenterTokenizer
  • MeCabTokenizer

How To Use

IgoTokenizer:

import igo.Tagger
import whooshjp
from whooshjp.IgoTokenizer import IgoTokenizer

tk = IgoTokenizer(igo.Tagger.Tagger('ipadic'))
scm = Schema(title=TEXT(stored=True, analyzer=tk), path=ID(unique=True,stored=True), content=TEXT(analyzer=tk))

TinySegmenterTokenizer:

import tinysegmenter
import whooshjp
from whooshjp.TinySegmenterTokenizer import TinySegmenterTokenizer

tk = TinySegmenterTokenizer(tinysegmenter.TinySegmenter())
scm = Schema(title=TEXT(stored=True, analyzer=tk), path=ID(unique=True,stored=True), content=TEXT(analyzer=tk))

Changelog for Japanese Tokenizers for Whoosh

2011-02-19 – 0.1
  • first release.
2011-02-21 – 0.2
  • add TinySegmenterTokenizer
  • change module name
2011-02-24 – 0.3
  • add FeatureFilter
2011-02-27 – 0.4
  • add MeCabTokenizer
  • add a mode for don’t pickle igo tagger to minimize index.
2011-04-17 – 0.5
  • correct char offsets
2011-04-17 – 0.6
  • correct char offsets(TinySegmenterTokenizer)
2012-04-14 – 0.7
  • rename package(WhooshJapaneseTokenizer to whooshjp)
  • no longer import sub modules automatically
  • Python3 compatibility(3.2, 3.3)
  • Drop Python2.5 support
 
File Type Py Version Uploaded on Size
whoosh-igo-0.7.tar.gz (md5) Source 2012-07-16 7KB