Measuring corpus similarity in Python

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

corpus_similarity

Measure the similarity between two corpora (text datasets). The measures work best when each corpus is at least 10k words.

from corpus_similarity import Similarity
cs = Similarity(language = "eng")

result = cs.calculate(corpus1, corpus2)

The package contains all preprocessing and training. Only the language needs to be specified. A list of supported languages is provided below.

Input

The Similarity.calculate method requires two input corpora. These can be a list of strings or a filename (supports .txt and .gz files).

Output

The output is a scalar measure of how similar the two corpora are. The values fall between 0 (very different) and 1 (very similar). The values are consistent within languages, but not across languages. For example, Swedish has higher relative similarity than Estonian.

Installation

pip install corpus_similarity

pip install git+https://github.com/jonathandunn/corpus_similarity.git

Languages

Pacific Languages

haw, Hawaiian (Polynesian)

mri, te reo (Polynesian)

smo, Samoan (Polynesian)

ton, Tongan (Polynesian)

ceb, Cebuano (Austronesian)

mlg, Malagasy (Austronesian)

msa, Malay (Austronesian)

tgl, Tagalog (Austronesian)

Other Languages

vie, Vietnamese

ind, Indonesian

tgl, Tagalog

tam, Tamil

tel, Telugu

bul, Bulgarian

ces, Czech

lav, Latvian

pol, Polish

rus, Russian

slv, Slovenian

ukr, Ukrainian

dan, Danish

deu, German

eng, English

nld, Dutch

nor, Norwegian

swe, Swedish

ell, Greek

fas, Farsi

hin, Hindi

urd, Urdu

cat, Catalan

fra, French

glg, Galician

ita, Italian

por, Portuguese

ron, Romanian

spa, Spanish

jpn, Japanese

kor, Korean

ara, Arabic

heb, Hebrew

zho, Chinese

tha, Thai

tur, Turkish

est, Estonian

fin, Finnish

hun, Hungarian

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.1

Jan 26, 2024

This version

1.0.1

Jul 31, 2021

1.0

Jun 29, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

corpus_similarity-1.0.1-py2.py3-none-any.whl (2.8 MB view hashes)

Uploaded Jul 31, 2021 Python 2 Python 3

Hashes for corpus_similarity-1.0.1-py2.py3-none-any.whl

Hashes for corpus_similarity-1.0.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`68948866e71b21482e6322e2e50bb66448be0b2debe9273064e1c3e490f6a5fb`
MD5	`8b1eec26780cb1c39928fad693fdf780`
BLAKE2b-256	`210aea64551fe2a10a215e24f300c82ce1efb93580a54dd4d0a41ddec119e1c6`