0 purchases
whooshigo 0.7
About
Tokenizers for Whoosh full text search library designed for Japanese language.
This package conteins two Tokenizers.
IgoTokenizer
requires igo-python(http://pypi.python.org/pypi/igo-python/) and its dictionary.
TinySegmenterTokenizer
requires TinySegmenter in Python(https://code.google.com/p/mhagiwara/source/browse/trunk/nltk/jpbook/tinysegmenter.py)
MeCabTokenizer
requires MeCab python binding(http://mecab.sourceforge.net/bindings.html)
How To Use
IgoTokenizer:
import igo.Tagger
import whooshjp
from whooshjp.IgoTokenizer import IgoTokenizer
tk = IgoTokenizer(igo.Tagger.Tagger('ipadic'))
scm = Schema(title=TEXT(stored=True, analyzer=tk), path=ID(unique=True,stored=True), content=TEXT(analyzer=tk))
TinySegmenterTokenizer:
import tinysegmenter
import whooshjp
from whooshjp.TinySegmenterTokenizer import TinySegmenterTokenizer
tk = TinySegmenterTokenizer(tinysegmenter.TinySegmenter())
scm = Schema(title=TEXT(stored=True, analyzer=tk), path=ID(unique=True,stored=True), content=TEXT(analyzer=tk))
Changelog for Japanese Tokenizers for Whoosh
2011-02-19 – 0.1
first release.
2011-02-21 – 0.2
add TinySegmenterTokenizer
change module name
2011-02-24 – 0.3
add FeatureFilter
2011-02-27 – 0.4
add MeCabTokenizer
add a mode for don’t pickle igo tagger to minimize index.
2011-04-17 – 0.5
correct char offsets
2011-04-17 – 0.6
correct char offsets(TinySegmenterTokenizer)
2012-04-14 – 0.7
rename package(WhooshJapaneseTokenizer to whooshjp)
no longer import sub modules automatically
Python3 compatibility(3.2, 3.3)
Drop Python2.5 support
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.