Skip to main content

Geographically-informed language identification

Project description

geoLid

Geographically-informed language identification

This Python package carries out language identification with geographic priors to increase performance for low-resource and under-represented languages.

A description and evaluation of this approach can be found here: https://jdunn.name/2024/03/13/geographically-informed-language-identification/

A complete list of language codes and names per regional model can be found in the language_names directory.

Downloading models

geoLid contains a baseline non-geographic model as well as models for 16 specific regions, as shown below:

baseline (916 languages)
africa_north (44 languages)
africa_southern (58 languages)
africa_sub (166 languages)
america_brazil (88 languages)
america_central (188 languages)
america_north (68 languages)
america_south (129 languages)
asia_central (54 languages)
asia_east (46 languages)
asia_south (60 languages)
asia_southeast (325 languages)
europe_east (65 languages)
europe_russia (65 languages)
europe_west (108 languages)
middle_east (53 languages)
oceania (49 languages)

To download models, use this command:

from geoLid import download_model
download_model("baseline")

The model name "all" will download all region-specific models.

Usage

Language identification can be used as shown below:

from geoLid import geoLid
lid = geoLid(model_location = "models")
labels = lid.predict(data = data, region = "baseline")

The model_location during initialization points to the directory containing the LID models.

The input variable data is a list containing at least one string that represents a text to make predictions about.

The region variable indicates which region-specific model should be used. The default is to use the non-geographic baseline model.

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geoLid-1.0.tar.gz (16.2 kB view hashes)

Uploaded Source

Built Distribution

geoLid-1.0-py3-none-any.whl (16.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page