Skip to main content

Measuring corpus similarity in Python

Project description

corpus_similarity

Measure the similarity between two corpora (text datasets). The measures work best when each corpus is at least 10k words.

from corpus_similarity import Similarity
cs = Similarity(language = "eng")

result = cs.calculate(corpus1, corpus2)

The package contains all preprocessing and training. Only the language needs to be specified. A list of supported languages is provided below.

Input

The Similarity.calculate method requires two input corpora. These can be a list of strings or a filename (supports .txt and .gz files).

Output

The output is a scalar measure of how similar the two corpora are. The values fall between 0 (very different) and 1 (very similar). The values are consistent within languages, but not across languages. For example, Swedish has higher relative similarity than Estonian.

Installation

pip install corpus_similarity

pip install git+https://github.com/jonathandunn/corpus_similarity.git

Languages

Pacific Languages

haw, Hawaiian (Polynesian)

mri, te reo (Polynesian)

smo, Samoan (Polynesian)

ton, Tongan (Polynesian)

ceb, Cebuano (Austronesian)

mlg, Malagasy (Austronesian)

msa, Malay (Austronesian)

tgl, Tagalog (Austronesian)

Other Languages

vie, Vietnamese

ind, Indonesian

tgl, Tagalog

tam, Tamil

tel, Telugu

bul, Bulgarian

ces, Czech

lav, Latvian

pol, Polish

rus, Russian

slv, Slovenian

ukr, Ukrainian

dan, Danish

deu, German

eng, English

nld, Dutch

nor, Norwegian

swe, Swedish

ell, Greek

fas, Farsi

hin, Hindi

urd, Urdu

cat, Catalan

fra, French

glg, Galician

ita, Italian

por, Portuguese

ron, Romanian

spa, Spanish

jpn, Japanese

kor, Korean

ara, Arabic

heb, Hebrew

zho, Chinese

tha, Thai

tur, Turkish

est, Estonian

fin, Finnish

hun, Hungarian

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

corpus_similarity-1.0.1-py2.py3-none-any.whl (2.8 MB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page