JapaneseTokenizer

No project description provided

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

What’s this?

This is simple wrapper for Japanese Tokenizers(A.K.A Morphology Splitter)

This project aims to call Tokenizer and split into tokens as easy as possible.

And this project supports various Tokenization tools. You can compare results among them.

This project is available also in Github.

If you find any bugs, please report them to github issues. Or any pull requests are welcomed!

Requirements

Python 2.7
Python 3.5

Features

You can get set of tokens from input sentence
You can filter some tokens with your Part-of-Speech condition or stopwords
You can add extension dictionary like mecab-neologd dictionary
You can define your original dictionary. And this dictionary forces mecab to make it one token

Supported Tokenization tool

Mecab

Mecab is open source tokenizer system for various language(if you have dictionary for it)

See english documentation for detail

Juman

Juman is a tokenizer system developed by Kurohashi laboratory, Kyoto University, Japan.

Juman is strong for ambiguous writing style in Japanese, and is strong for new-comming words thanks to Web based huge dictionary.

And, Juman tells you semantic meaning of words.

Juman++

Juman++ is a tokenizer system developed by Kurohashi laboratory, Kyoto University, Japan.

Juman++ is succeeding system of Juman. It adopts RNN model for tokenization.

Juman++ is strong for ambigious writing style in Japanese, and is strong for new-comming words thanks to Web based huge dictionary.

And, Juman tells you semantic meaning of words.

Kytea

Kytea is tokenizer tool developped by Graham Neubig.

Kytea has a different algorithm from one of Mecab or Juman.

Setting up

Tokenizers auto-install

make install

mecab-neologd dictionary auto-install

make install_neologd

Tokenizers manual-install

MeCab

See here to install MeCab system.

Mecab Neologd dictionary

Mecab-neologd dictionary is a dictionary-extension based on ipadic-dictionary, which is basic dictionary of Mecab.

With, Mecab-neologd dictionary, you’re able to parse new-coming words make one token.

Here, new-coming words is such like, movie actor name or company name…..

See here and install mecab-neologd dictionary.

Juman

wget -O juman7.0.1.tar.bz2 "http://nlp.ist.i.kyoto-u.ac.jp/DLcounter/lime.cgi?down=http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/juman/juman-7.01.tar.bz2&name=juman-7.01.tar.bz2"
bzip2 -dc juman7.0.1.tar.bz2  | tar xvf -
cd juman-7.01
./configure
make
[sudo] make install

Juman++

GCC version must be >= 5

wget http://lotus.kuee.kyoto-u.ac.jp/nl-resource/jumanpp/jumanpp-1.02.tar.xz
tar xJvf jumanpp-1.02.tar.xz
cd jumanpp-1.02/
./configure
make
[sudo] make install

Kytea

Install Kytea system

wget http://www.phontron.com/kytea/download/kytea-0.4.7.tar.gz
tar -xvf kytea-0.4.7.tar
cd kytea-0.4.7
./configure
make
make install

Kytea has python wrapper thanks to michiaki ariga. Install Kytea-python wrapper

pip install kytea

install

[sudo] python setup.py install

Note

During install, you see warning message when it fails to install pyknp or kytea.

if you see these messages, try to re-install these packages manually.

Usage

Tokenization Example(For python3.x. To see exmaple code for Python2.x, plaese see here)

import JapaneseTokenizer
input_sentence = '10日放送の「中居正広のミになる図書館」（テレビ朝日系）で、SMAPの中居正広が、篠原信一の過去の勘違いを明かす一幕があった。'
# ipadic is well-maintained dictionary #
mecab_wrapper = JapaneseTokenizer.MecabWrapper(dictType='ipadic')
print(mecab_wrapper.tokenize(input_sentence).convert_list_object())

# neologd is automatically-generated dictionary from huge web-corpus #
mecab_neologd_wrapper = JapaneseTokenizer.MecabWrapper(dictType='neologd')
print(mecab_neologd_wrapper.tokenize(input_sentence).convert_list_object())

Filtering example

import JapaneseTokenizer
# with word filtering by stopword & part-of-speech condition #
print(mecab_wrapper.tokenize(input_sentence).filter(stopwords=['テレビ朝日'], pos_condition=[('名詞', '固有名詞')]).convert_list_object())

Part-of-speech structure

Mecab, Juman uses different system of Part-of-Speech(POS).

Keep in your mind when you use it.

You can check tables of Part-of-Speech(POS) here

Similar Package

natto-py

natto-py is sophisticated package for tokenization. It supports following features

easy interface for tokenization
importing additional dictionary
partial parsing mode

CHANGES

0.6(2016-03-05)

first release to Pypi

0.7(2016-03-06)

Juman supports(only for python2.x)
Kytea supports(only for python2.x)

0.8(2016-04-03)

removed a bug when interface calls JUMAN
fixed the version number of jctconv

0.9 (2016-04-05)

Kytea supports also for Python3.x(Thanks to @chezou)

1.0 (2016-06-19)

Juman supports also for Python3.x

1.2.5 (2016-12-28)

It fixed bugs in Juman server mode in python3.x
It supports Juman++
It supports filter method with chain expression

1.2.6 (2017-01-12)

It introduced a paramter on text normalization function
- All \n strings are converted into 。. This is because \n string in input-text causes tokenization error especially with server-mode.

1.2.8 (2017-02-22)

It has make file for installing tokenizers.
It is tested with travis.

1.3.0 (2017-02-23)

It introduced de-normalization function after tokenization process. (全角英数 -> 半角英数)
For mecab-config, it detects path to mecab-config automatically
It fixed a bug of initializing juman-object in python2

after 1.3.0

change logs are in github release.

LICENSE

MIT license

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.6

Mar 25, 2019

1.5

Jan 21, 2019

1.4

Dec 24, 2018

1.3.7

Feb 27, 2018

1.3.6

Nov 1, 2017

1.3.5

Sep 27, 2017

1.3.4

Sep 21, 2017

This version

1.3.3

Sep 11, 2017

1.3.1

Jun 29, 2017

1.3.0

Feb 23, 2017

1.2.7

Jan 13, 2017

1.2.6

Jan 11, 2017

1.2.5

Dec 28, 2016

1.2.3

Dec 8, 2016

1.0

Aug 3, 2016

1.0b0 pre-release

Jun 22, 2016

1.0a0 pre-release

Jun 19, 2016

0.9

Apr 4, 2016

0.8

Apr 2, 2016

0.7

Mar 6, 2016

0.6a1 pre-release

Mar 5, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

JapaneseTokenizer-1.3.3.tar.gz (29.0 kB view hashes)

Uploaded Sep 11, 2017 Source

Hashes for JapaneseTokenizer-1.3.3.tar.gz

Hashes for JapaneseTokenizer-1.3.3.tar.gz
Algorithm	Hash digest
SHA256	`f18bdbb1883a02d2cacfa42cf41b3d5198a804b986c06ec5f61b37c4d0ca0a82`
MD5	`3c19ae9d41dc190c59be3664e3c9b659`
BLAKE2b-256	`b41c9939737367a9fbd76d83fb3785676f1c25f441e73612e2b8ef7cd0f96ca4`

JapaneseTokenizer 1.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Project description

What’s this?

Requirements

Features

Supported Tokenization tool

Mecab

Juman

Juman++

Kytea

Setting up

Tokenizers auto-install

mecab-neologd dictionary auto-install

Tokenizers manual-install

MeCab

Mecab Neologd dictionary

Juman

Juman++

Kytea

install

Note

Usage

Filtering example

Part-of-speech structure

Similar Package

CHANGES

0.6(2016-03-05)

0.7(2016-03-06)

0.8(2016-04-03)

0.9 (2016-04-05)

1.0 (2016-06-19)

1.2.5 (2016-12-28)

1.2.6 (2017-01-12)

1.2.8 (2017-02-22)

1.3.0 (2017-02-23)

after 1.3.0

LICENSE

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution