Persian NLP Toolkit

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- Persian
Programming Language
Topic
- Text Processing

Project description

Hazm - Persian NLP Toolkit

Tests PyPI - Downloads PyPI - Python Version GitHub

Evaluation
Introduction
Features
Installation
Pretrained-Models
Usage
Documentation
Hazm in other languages
Contribution
Thanks
- Code contributores
- Others

Evaluation

Module name
DependencyParser	85.6%
POSTagger	98.8%
Chunker	93.4%
Lemmatizer	89.9%

	Metric	Value
SpacyPOSTagger	Precision	0.99250
	Recall	0.99249
	F1-Score	0.99249
EZ Detection in SpacyPOSTagger	Precision	0.99301
	Recall	0.99297
	F1-Score	0.99298
SpacyChunker	Accuracy	96.53%
	F-Measure	95.00%
	Recall	95.17%
	Precision	94.83%
SpacyDependencyParser	TOK Accuracy	99.06
	UAS	92.30
	LAS	89.15
	SENT Precision	98.84
	SENT Recall	99.38
	SENT F-Measure	99.11

Introduction

Hazm is a python library to perform natural language processing tasks on Persian text. It offers various features for analyzing, processing, and understanding Persian text. You can use Hazm to normalize text, tokenize sentences and words, lemmatize words, assign part-of-speech tags, identify dependency relations, create word and sentence embeddings, or read popular Persian corpora.

Features

Normalization: Converts text to a standard form, such as removing diacritics, correcting spacing, etc.
Tokenization: Splits text into sentences and words.
Lemmatization: Reduces words to their base forms.
POS tagging: Assigns a part of speech to each word.
Dependency parsing: Identifies the syntactic relations between words.
Embedding: Creates vector representations of words and sentences.
Persian corpora reading: Easily read popular Persian corpora with ready-made scripts and minimal code.

Installation

To install the latest version of Hazm, run the following command in your terminal:

pip install hazm

Alternatively, you can install the latest update from GitHub (this version may be unstable and buggy):

pip install git+https://github.com/roshan-research/hazm.git

Pretrained-Models

Finally if you want to use our pretrained models, you can download it from the links below:

Module name	Size
Download WordEmbedding	~ 5 GB
Download SentEmbedding	~ 1 GB
Download POSTagger	~ 18 MB
Download DependencyParser	~ 15 MB
Download Chunker	~ 4 MB
Download spacy_pos_tagger_parsbertpostagger	~ 630 MB
Download spacy_pos_tagger_parsbertpostagger95	~ 630 MB
Download spacy_chunker_uncased_bert	~ 650 MB
Download spacy_chunker_parsbert	~ 630 MB
Download spacy_dependency_parser	~ 630 MB

Usage

>>> from hazm import *

>>> normalizer = Normalizer()
>>> normalizer.normalize('اصلاح نويسه ها و استفاده از نیم‌فاصله پردازش را آسان مي كند')
'اصلاح نویسه‌ها و استفاده از نیم‌فاصله پردازش را آسان می‌کند'

>>> sent_tokenize('ما هم برای وصل کردن آمدیم! ولی برای پردازش، جدا بهتر نیست؟')
['ما هم برای وصل کردن آمدیم!', 'ولی برای پردازش، جدا بهتر نیست؟']
>>> word_tokenize('ولی برای پردازش، جدا بهتر نیست؟')
['ولی', 'برای', 'پردازش', '،', 'جدا', 'بهتر', 'نیست', '؟']

>>> stemmer = Stemmer()
>>> stemmer.stem('کتاب‌ها')
'کتاب'
>>> lemmatizer = Lemmatizer()
>>> lemmatizer.lemmatize('می‌روم')
'رفت#رو'

>>> tagger = POSTagger(model='pos_tagger.model')
>>> tagger.tag(word_tokenize('ما بسیار کتاب می‌خوانیم'))
[('ما', 'PRO'), ('بسیار', 'ADV'), ('کتاب', 'N'), ('می‌خوانیم', 'V')]

>>> spacy_posTagger = SpacyPOSTagger(model_path = 'MODELPATH')
>>> spacy_posTagger.tag(tokens = ['من', 'به', 'مدرسه', 'ایران', 'رفته_بودم', '.'])
[('من', 'PRON'), ('به', 'ADP'), ('مدرسه', 'NOUN,EZ'), ('ایران', 'NOUN'), ('رفته_بودم', 'VERB'), ('.', 'PUNCT')]

>>> posTagger = POSTagger(model = 'pos_tagger.model', universal_tag = False)
>>> posTagger.tag(tokens = ['من', 'به', 'مدرسه', 'ایران', 'رفته_بودم', '.'])
[('من', 'PRON'), ('به', 'ADP'), ('مدرسه', 'NOUN'), ('ایران', 'NOUN'), ('رفته_بودم', 'VERB'), ('.', 'PUNCT')] 

>>> chunker = Chunker(model='chunker.model')
>>> tagged = tagger.tag(word_tokenize('کتاب خواندن را دوست داریم'))
>>> tree2brackets(chunker.parse(tagged))
'[کتاب خواندن NP] [را POSTP] [دوست داریم VP]'

>>> spacy_chunker = SpacyChunker(model_path = 'model_path')
>>> tree = spacy_chunker.parse(sentence = [('نامه', 'NOUN,EZ'), ('ایشان', 'PRON'), ('را', 'ADP'), ('دریافت', 'NOUN'), ('داشتم', 'VERB'), ('.', 'PUNCT')])
>>> print(tree)
(S
  (NP نامه/NOUN,EZ ایشان/PRON)
  (POSTP را/ADP)
  (VP دریافت/NOUN داشتم/VERB)
  ./PUNCT)

>>> word_embedding = WordEmbedding(model_type = 'fasttext', model_path = 'word2vec.bin')
>>> word_embedding.doesnt_match(['سلام' ,'درود' ,'خداحافظ' ,'پنجره'])
'پنجره'
>>> word_embedding.doesnt_match(['ساعت' ,'پلنگ' ,'شیر'])
'ساعت'

>>> parser = DependencyParser(tagger=tagger, lemmatizer=lemmatizer)
>>> parser.parse(word_tokenize('زنگ‌ها برای که به صدا درمی‌آید؟'))
<DependencyGraph with 8 nodes>

>>> spacy_parser = SpacyDependencyParser(tagger=tagger, lemmatizer=lemmatizer)
>>> spacy_parser.parse_sents([word_tokenize('زنگ‌ها برای که به صدا درمی‌آید؟')])

Documentation

Visit https://roshan-ai.ir/hazm/docs to view the full documentation.

Hazm in other languages

Disclaimer: These ports are not developed or maintained by Roshan. They may not have the same functionality or quality as the original Hazm..

JHazm: A Java port of Hazm
NHazm: A C# port of Hazm

Contribution

We welcome and appreciate any contributions to this repo, such as bug reports, feature requests, code improvements, documentation updates, etc. Please follow the Contribution guideline when contributing. You can open an issue, fork the repo, write your code, create a pull request and wait for a review and feedback. Thank you for your interest and support in this repo!

Thanks

Code contributores

Alt

Others

Thanks to Virastyar project for providing the persian word list.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- Persian
Programming Language
Topic
- Text Processing

Release history Release notifications | RSS feed

This version

0.10.0

Jan 16, 2024

0.9.4

Oct 1, 2023

0.9.3

Jul 19, 2023

0.9.2

Jul 8, 2023

0.9.1

Jun 30, 2023

0.7.0

Oct 12, 2018

0.6.0.1

Oct 12, 2018

0.5.2

Oct 7, 2015

0.5.1

Jun 29, 2015

0.5

Mar 20, 2015

0.4

Dec 16, 2014

0.3

Aug 29, 2014

0.2

Jul 11, 2014

0.1

Dec 14, 2013

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hazm-0.10.0.tar.gz (874.3 kB view hashes)

Uploaded Jan 16, 2024 Source

Built Distribution

hazm-0.10.0-py3-none-any.whl (892.6 kB view hashes)

Uploaded Jan 16, 2024 Python 3

Hashes for hazm-0.10.0.tar.gz

Hashes for hazm-0.10.0.tar.gz
Algorithm	Hash digest
SHA256	`a356543004630a9338cc09f6725a30f9928cdabda9353092bd21585b4329a97d`
MD5	`cb20275bd4a794c4b69b94e1ea77163a`
BLAKE2b-256	`50ed996f77a9c0c49f195a859de3096ee837b7a28f31498221b5d1fd0d00288b`

Hashes for hazm-0.10.0-py3-none-any.whl

Hashes for hazm-0.10.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`525c9b32914b98e50dab27fbd4f79c1067c898668812de095b8cdd81cc52b0ef`
MD5	`64e33e12b331466253cb7ddf638959a8`
BLAKE2b-256	`918ccc3d01c27681eb8223781ea162a23f9926647ce864eb601a19aee4bce0af`