JapaneseTokenizer

Interface package for Japanese tokenization

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

What’s this?

This is simple wrapper for Japanese Tokenizers(A.K.A Morphology Splitter)

This repository aims to call Tokenizer and split into tokens in one line.

If you find any bugs, please report them to github issues. Or any pull requests are welcomed!

Requirements

Python 2.7
Python 3.5

Features

You can get set of tokens from input sentence
You can filter some tokens with your Part-of-Speech condition or stopwords
You can add extension dictionary like mecab-neologd dictionary
You can define your original dictionary. And this dictionary forces mecab to make it one token

Setting up

MeCab system

See here to install MeCab system.

Mecab Neologd dictionary

Mecab-neologd dictionary is a dictionary-extension based on ipadic-dictionary, which is basic dictionary of Mecab.

With, Mecab-neologd dictionary, you’re able to new-coming words make one token.

Here, new-coming words is suche like, movie actor name or company name…..

See here[https://github.com/neologd/mecab-ipadic-neologd] and install mecab-neologd dictionary.

install

[sudo] python setup.py install

Usage

Tokenization Example(For python2x. To see exmaple code for Python3.x, plaese see here)

# input is `unicode` type(in python2x)
sentence = u'テヘラン（ペルシア語: تهران  ; Tehrān Tehran.ogg 発音[ヘルプ/ファイル]/teɦˈrɔːn/、英語:Tehran）は、西アジア、イランの首都でありかつテヘラン州の州都。人口12,223,598人。都市圏人口は13,413,348人に達する。'

# make MecabWrapper object
# path where `mecab-config` command exists. You can check it with `which mecab-config`
# default value is '/usr/local/bin'
path_mecab_config='/usr/local/bin'

# you can choose from "neologd", "all", "ipaddic", "user", ""
# "ipadic" and "" is equivalent
dictType = ""

mecab_wrapper = MecabWrapper(dictType=dictType, path_mecab_config=path_mecab_config)

# tokenize sentence. Returned object is list of tuples
tokenized_obj = mecab_wrapper.tokenize(sentence=sentence)
assert isinstance(tokenized_obj, list)

# Returned object is "TokenizedSenetence" class if you put return_list=False
tokenized_obj = mecab_wrapper.tokenize(sentence=sentence, return_list=False)

Filtering example

stopwords = [u'テヘラン']
assert isinstance(tokenized_obj, TokenizedSenetence)
# returned object is "FilteredObject" class
filtered_obj = mecab_wrapper.filter(
    parsed_sentence=tokenized_obj,
    stopwords=stopwords
)
assert isinstance(filtered_obj, FilteredObject)

# pos condition is list of tuples
# You can set POS condition "ChaSen 品詞体系 (IPA品詞体系)" of this page http://www.unixuser.org/~euske/doc/postag/#chasen
pos_condition = [(u'名詞', u'固有名詞'), (u'動詞', u'自立')]
filtered_obj = mecab_wrapper.filter(
    parsed_sentence=tokenized_obj,
    pos_condition=pos_condition
)

Similar Package

natto-py

natto-py is sophisticated package for tokenization. It supports following features

easy interface for tokenization
importing additional dictionary
partial parsing mode

CHANGES

0.6(2016-03-05)

first release to Pypi

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.6

Mar 25, 2019

1.5

Jan 21, 2019

1.4

Dec 24, 2018

1.3.7

Feb 27, 2018

1.3.6

Nov 1, 2017

1.3.5

Sep 27, 2017

1.3.4

Sep 21, 2017

1.3.3

Sep 11, 2017

1.3.1

Jun 29, 2017

1.3.0

Feb 23, 2017

1.2.7

Jan 13, 2017

1.2.6

Jan 11, 2017

1.2.5

Dec 28, 2016

1.2.3

Dec 8, 2016

1.0

Aug 3, 2016

1.0b0 pre-release

Jun 22, 2016

1.0a0 pre-release

Jun 19, 2016

0.9

Apr 4, 2016

0.8

Apr 2, 2016

0.7

Mar 6, 2016

This version

0.6a1 pre-release

Mar 5, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

JapaneseTokenizer-0.6a1.tar.gz (10.6 kB view hashes)

Uploaded Mar 5, 2016 Source

Hashes for JapaneseTokenizer-0.6a1.tar.gz

Hashes for JapaneseTokenizer-0.6a1.tar.gz
Algorithm	Hash digest
SHA256	`c15a23d01f1ad997049e1f89333bb421b98fd5f0a2fbe7ae005662a9cd5a383a`
MD5	`e55f95650df77ef126ee23b2cc88ad20`
BLAKE2b-256	`5b96cc92357c7e7261c791db6f769bda1cd2c1f4547ea64bd1b75227c2818643`