Transliterations to/from Indian languages
Project description
indicate: transliterate indic languages to english
Transliterations to/from Indian languages are still generally low quality. One problem is access to data. Another is that there is no standard transliteration.
For Hindi–English, we build novel dataset for names using the ESPNcricinfo. For instance, see here for Hindi version of the English scorecard.
We also create a dataset from election affidavits
We also exploit the Google Dakshina dataset.
To overcome the fact that there isn’t one standard way of transliteration, we provide k-best transliterations.
Install
We strongly recommend installing indicate inside a Python virtual environment (see venv documentation)
pip install indicate
General API
Examples
Functions
We expose 6 functions, each of which either take a pandas DataFrame or a CSV. If the CSV doesn’t have a header, we make some assumptions about where the data is:
census_ln(df, namecol, year=2000)
What it does:
Removes extra space
For names in the census file, it appends relevant data of what probability the name provided is of a certain race/ethnicity
Parameters
df : {DataFrame, csv} Pandas dataframe of CSV file contains the names of the individual to be inferred
namecol : {string, list, int} string or list of the name or location of the column containing the last name
Year : {2000, 2010}, default=2000 year of census to use
Output: Appends the following columns to the pandas DataFrame or CSV: pctwhite, pctblack, pctapi, pctaian, pct2prace, pcthispanic. See here for what the column names mean.
>>> import pandas as pd >>> from ethnicolr import census_ln, pred_census_ln >>> names = [{'name': 'smith'}, ... {'name': 'zhang'}, ... {'name': 'jackson'}] >>> df = pd.DataFrame(names) >>> df name 0 smith 1 zhang 2 jackson >>> census_ln(df, 'name') name pctwhite pctblack pctapi pctaian pct2prace pcthispanic 0 smith 73.35 22.22 0.40 0.85 1.63 1.56 1 zhang 0.61 0.09 98.16 0.02 0.96 0.16 2 jackson 41.93 53.02 0.31 1.04 2.18 1.53
Data
Evaluation
Contributor Code of Conduct
The project welcomes contributions from everyone! In fact, it depends on it. To maintain this welcoming atmosphere, and to collaborate in a fun and productive way, we expect contributors to the project to abide by the Contributor Code of Conduct.
License
The package is released under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for indicate-0.0.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 94acb084685e5a6c31f2ed4fe12fd2effb90e14c524908c39a61b0d69b9b8c87 |
|
MD5 | 731840a3bf87f74ee472a283c02a17d6 |
|
BLAKE2b-256 | 2ae3b691f36f8b9f24a89fc9f541b5310ea394c9ca3c28038e642df977bccf12 |