Skip to main content

A Japanese tokenizer based on recurrent neural networks

Project description


Python package Build status Coverage Status Documentation Status PyPI Hugging Face Spaces Downloads

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool.

This tool has the following features.

  • Based on recurrent neural networks.
  • The word segmentation model uses character- and word-level features [池田+].
  • The POS-tagging model uses tag dictionary information [Inoue+].

For more details refer to the following links.

  • The stop words for nagisa are available here.
  • The presentation slide at PyCon JP (2022) is available here.
  • The article in Japanese is available here.
  • The documentation is available here.

Installation

Python 3.6 through 3.12 on Linux, or Python 3.6 through 3.11 on macOS Intel is required. This tool uses DyNet (the Dynamic Neural Network Toolkit) to calcucate neural networks. You can install nagisa by using the following command.

pip install nagisa

For Windows users, please run it with python 3.6, 3.7 or 3.8 (64bit). It is also compatible with the Windows Subsystem for Linux (WSL).

Basic usage

Sample of word segmentation and POS-tagging for Japanese.

import nagisa

text = 'Pythonで簡単に使えるツールです'
words = nagisa.tagging(text)
print(words)
#=> Python/名詞 で/助詞 簡単/形状詞 に/助動詞 使える/動詞 ツール/名詞 です/助動詞

# Get a list of words
print(words.words)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']

# Get a list of POS-tags
print(words.postags)
#=> ['名詞', '助詞', '形状詞', '助動詞', '動詞', '名詞', '助動詞']

Post-processing functions

Filter and extarct words by the specific POS tags.

# Filter the words of the specific POS tags.
words = nagisa.filter(text, filter_postags=['助詞', '助動詞'])
print(words)
#=> Python/名詞 簡単/形状詞 使える/動詞 ツール/名詞

# Extarct only nouns.
words = nagisa.extract(text, extract_postags=['名詞'])
print(words)
#=> Python/名詞 ツール/名詞

# This is a list of available POS-tags in nagisa.
print(nagisa.tagger.postags)
#=> ['補助記号', '名詞', ... , 'URL']

Add the user dictionary in easy way.

# default
text = "3月に見た「3月のライオン」"
print(nagisa.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3/名詞 月/名詞 の/助詞 ライオン/名詞 」/補助記号

# If a word ("3月のライオン") is included in the single_word_list, it is recognized as a single word.
new_tagger = nagisa.Tagger(single_word_list=['3月のライオン'])
print(new_tagger.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3月のライオン/名詞 」/補助記号

Train a model

Nagisa (v0.2.0+) provides a simple train method for a joint word segmentation and sequence labeling (e.g, POS-tagging, NER) model.

The format of the train/dev/test files is tsv. Each line is word and tag and one line is represented by word \t(tab) tag. Note that you put EOS between sentences. Refer to sample datasets and tutorial (Train a model for Universal Dependencies).

$ cat sample.train
唯一	NOUN
の	ADP
趣味	NOU
は	ADP
料理	NOUN
EOS
とても	ADV
おいしかっ	ADJ
た	AUX
です	AUX
。	PUNCT
EOS
ドル	NOUN
は	ADP
主要	ADJ
通貨	NOUN
EOS
# After finish training, save the three model files (*.vocabs, *.params, *.hp).
nagisa.fit(train_file="sample.train", dev_file="sample.dev", test_file="sample.test", model_name="sample")

# Build the tagger by loading the trained model files.
sample_tagger = nagisa.Tagger(vocabs='sample.vocabs', params='sample.params', hp='sample.hp')

text = "福岡・博多の観光情報"
words = sample_tagger.tagging(text)
print(words)
#> 福岡/PROPN ・/SYM 博多/PROPN の/ADP 観光/NOUN 情報/NOUN

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nagisa-0.2.11.tar.gz (20.9 MB view hashes)

Uploaded Source

Built Distributions

nagisa-0.2.11-cp312-cp312-musllinux_1_1_x86_64.whl (21.7 MB view hashes)

Uploaded CPython 3.12 musllinux: musl 1.1+ x86-64

nagisa-0.2.11-cp312-cp312-musllinux_1_1_i686.whl (21.6 MB view hashes)

Uploaded CPython 3.12 musllinux: musl 1.1+ i686

nagisa-0.2.11-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.7 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

nagisa-0.2.11-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (21.6 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

nagisa-0.2.11-cp311-cp311-musllinux_1_1_x86_64.whl (21.7 MB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

nagisa-0.2.11-cp311-cp311-musllinux_1_1_i686.whl (21.7 MB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ i686

nagisa-0.2.11-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.7 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

nagisa-0.2.11-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (21.7 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

nagisa-0.2.11-cp311-cp311-macosx_10_9_x86_64.whl (21.4 MB view hashes)

Uploaded CPython 3.11 macOS 10.9+ x86-64

nagisa-0.2.11-cp310-cp310-musllinux_1_1_x86_64.whl (21.6 MB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

nagisa-0.2.11-cp310-cp310-musllinux_1_1_i686.whl (21.6 MB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ i686

nagisa-0.2.11-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.6 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

nagisa-0.2.11-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (21.6 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

nagisa-0.2.11-cp310-cp310-macosx_10_9_x86_64.whl (21.4 MB view hashes)

Uploaded CPython 3.10 macOS 10.9+ x86-64

nagisa-0.2.11-cp39-cp39-musllinux_1_1_x86_64.whl (21.6 MB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

nagisa-0.2.11-cp39-cp39-musllinux_1_1_i686.whl (21.6 MB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ i686

nagisa-0.2.11-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.6 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

nagisa-0.2.11-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (21.6 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

nagisa-0.2.11-cp39-cp39-macosx_10_9_x86_64.whl (21.4 MB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

nagisa-0.2.11-cp38-cp38-win_amd64.whl (21.4 MB view hashes)

Uploaded CPython 3.8 Windows x86-64

nagisa-0.2.11-cp38-cp38-musllinux_1_1_x86_64.whl (21.7 MB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ x86-64

nagisa-0.2.11-cp38-cp38-musllinux_1_1_i686.whl (21.6 MB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ i686

nagisa-0.2.11-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.6 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

nagisa-0.2.11-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (21.6 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

nagisa-0.2.11-cp38-cp38-macosx_10_9_x86_64.whl (21.3 MB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

nagisa-0.2.11-cp37-cp37m-win_amd64.whl (21.4 MB view hashes)

Uploaded CPython 3.7m Windows x86-64

nagisa-0.2.11-cp37-cp37m-musllinux_1_1_x86_64.whl (21.6 MB view hashes)

Uploaded CPython 3.7m musllinux: musl 1.1+ x86-64

nagisa-0.2.11-cp37-cp37m-musllinux_1_1_i686.whl (21.6 MB view hashes)

Uploaded CPython 3.7m musllinux: musl 1.1+ i686

nagisa-0.2.11-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.6 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

nagisa-0.2.11-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (21.6 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

nagisa-0.2.11-cp37-cp37m-macosx_10_9_x86_64.whl (21.3 MB view hashes)

Uploaded CPython 3.7m macOS 10.9+ x86-64

nagisa-0.2.11-cp36-cp36m-win_amd64.whl (21.4 MB view hashes)

Uploaded CPython 3.6m Windows x86-64

nagisa-0.2.11-cp36-cp36m-musllinux_1_1_x86_64.whl (21.6 MB view hashes)

Uploaded CPython 3.6m musllinux: musl 1.1+ x86-64

nagisa-0.2.11-cp36-cp36m-musllinux_1_1_i686.whl (21.6 MB view hashes)

Uploaded CPython 3.6m musllinux: musl 1.1+ i686

nagisa-0.2.11-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.6 MB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

nagisa-0.2.11-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (21.6 MB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

nagisa-0.2.11-cp36-cp36m-macosx_10_9_x86_64.whl (21.3 MB view hashes)

Uploaded CPython 3.6m macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page