epitran

Tools for transcribing languages into IPA.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

A library and tool for transliterating orthographic text as IPA (International Phonetic Alphabet).

Usage

The principle script for transliterating orthographic text as IPA is epitranscriber.py. It takes one argument, the ISO 639-3 code for the language of the orthographic text, takes orthographic text at standard in and writes Unicode IPA to standard out.

$ echo "Düğün olur bayram gelir" | epitranscribe.py "tur-Latn" dyɰyn oluɾ bajɾam ɟeliɾ
$ epitranscribe.py "tur-Latn" < orthography.txt > phonetic.txt

Additionally, the small Python modules epitran and epitran.vector can be used to easily write more sophisticated Python programs for deploying the Epitran mapping tables. This is documented below.

Using the epitran Module

The functionality in the epitran module is encapsulated in the very simple Epitran class. Its constructor takes one argument, code, the ISO 639-3 code of the language to be transliterated plus a hyphen plus a four letter code for the script (e.g. ‘Latn’ for Latin script, ‘Cyrl’ for Cyrillic script, and ‘Arab’ for a Person-Arabic script).

>>> import epitran
>>> epi = epitran.Epitran('tur-Latn')

The Epitran class has only a few “public” method (to the extent that such a concept exists in Python). The most important are transliterate and word_to_tuples:

Epitran.transliterate(text): Convert text (in Unicode-encoded orthography of the language specified in the constructor) to IPA, which is returned.

>>> epi.transliterate(u'Düğün')
u'dy\u0270yn'
>>> print(epi.transliterate(u'Düğün'))
dyɰyn

Epitran.word_to_tuples(word, normpunc=False): Takes a word (a Unicode string) in a supported orthography as input and returns a list of tuples with each tuple corresponding to an IPA segment of the word. The tuples have the following structure:

(
    character_category :: String,
    is_upper :: Integer,
    orthographic_form :: Unicode String,
    phonetic_form :: Unicode String,
    segments :: List<Tuples>
)

The codes for character_category are from the initial characters of the two character sequences listed in the “General Category” codes found in Chapter 4 of the Unicode Standard. For example, “L” corresponds to letters and “P” corresponds to production marks. The above data structure is likely to change in subsequent versions of the library. The structure of segments is as follows:

(
    segment :: Unicode String,
    vector :: List<Integer>
)

Here is an example of an interaction with word_to_tuples:

>>> import epitran
>>> epi = epitran.Epitran('tur-Latn')
>>> epi.word_to_tuples(u'Düğün')
[(u'L', 1, u'D', u'd', [(u'd', [-1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, 0, -1])]), (u'L', 0, u'u\u0308', u'y', [(u'y', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1])]), (u'L', 0, u'g\u0306', u'\u0270', [(u'\u0270', [-1, 1, -1, 1, 0, -1, -1, 0, 1, -1, -1, 0, -1, 0, -1, 1, -1, 0, -1, 1, -1])]), (u'L', 0, u'u\u0308', u'y', [(u'y', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1])]), (u'L', 0, u'n', u'n', [(u'n', [-1, 1, 1, -1, -1, -1, 1, -1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, 0, -1])])]

Using the epitran.vector Module

The epitran.vector module is also very simple. It contains one class, VectorsWithIPASpace, including one method of interest, word_to_segs:

The constructor for VectorsWithIPASpace takes two arguments: - code: the language-script code for the language to be processed. - space: the code for the punctuation/symbol/IPA space in which the characters/segments from the data are expected to reside. The available spaces are listed below.

It’s principle method is word_to_segs:

VectorWithIPASpace.word_to_segs(word, normpunc=False) Word is a Unicode string. If the keyword argument normpunc is set to True, punctuation disovered in word is normalized to ASCII equivalents.

A typical interaction with the VectorsWithIPASpace object via the word_to_segs method is illustrated here:

>>> import epitran.vector
>>> vwis = epitran.vector.VectorsWithIPASpace('uzb-Latn', 'uzb-with_attached_suffixes-space')
>>> vwis.word_to_segs(u'darë')
[(u'L', 0, u'd', u'd\u032a', u'40', [-1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1, 1, 1, 1, -1, -1, -1, -1, -1, 0, -1]), (u'L', 0, u'a', u'a', u'37', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, 1, -1]), (u'L', 0, u'r', u'r', u'54', [-1, 1, 1, 1, 0, -1, -1, -1, 1, -1, -1, 1, 1, -1, -1, 0, 0, 0, -1, 0, -1]), (u'L', 0, u'e\u0308', u'ja', u'46', [-1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, 0, -1, 1, -1, -1, -1, 0, -1]), (u'L', 0, u'e\u0308', u'ja', u'37', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, 1, -1])]

(It is important to note that, though the word that serves as input–darë–has four letters, the output contains four tuples because the last letter in darë actually corresponds to two IPA segments, /j/ and /a/.) The returned data structure is a list of tuples, each with the following structure:

(
    character_category :: String,
    is_upper :: Integer,
    orthographic_form :: Unicode String,
    phonetic_form :: Unicode String,
    in_ipa_punc_space :: Integer,
    phonological_feature_vector :: List<Integer>
)

A few notes are in order regarding this data structure: - character_category is defined as part of the Unicode standard (Chapter 4). It consists of a single, uppercase letter from the set {‘L’, ‘M’, ‘N’, ‘P’, ‘S’, ‘Z’, ‘C’}.. The most frequent of these are ‘L’ (letter), ‘N’ (number), ‘P’ (punctuation), and ‘Z’ (separator [including separating white space]). - is_upper consists only of integers from the set {0, 1}, with 0 indicating lowercase and 1 indicating uppercase. - The integer in in_ipa_punc_space is an index to a list of known characters/segments such that, barring degenerate cases, each character or segment is assignmed a unique and globally consistant number. In cases where a character is encountered which is not in the known space, this field has the value -1. - The length of the list phonological_feature_vector should be constant for any instantiation of the class (it is based on the number of features defined in panphon) but is–in principles–variable. The integers in this list are drawn from the set {-1, 0, 1}, with -1 corresponding to ‘-’, 0 corresponding to ‘0’, and 1 corresponding to ‘+’. For characters with no IPA equivalent, all values in the list are 0.

Language Support

Transliteration Languages

Code	Language (Script)
aze-Cyrl	Azerbaijani (Cyrillic)
aze-Latn	Azerbaijani (Latin)
hau-Latn	Hausa
ind-Latn	Indonesian
jav-Latn	Javanese
kaz-Cyrl	Kazakh (Cyrillic)
kaz-Latn	Kazakh (Latin)
kir-Arab	Kyrgyz (Perso-Arabic)
kir-Cyrl	Kyrgyz (Cyrillic)
kir-Latn	Kyrgyz (Latin)
tuk-Cyrl	Turkmen (Cyrillic)
tuk-Latn	Turkmen (Latin)
tur-Latn	Turkish (Latin)
yor-Latn	Yoruba
uig-Arab	Uyghur (Perso-Arabic)
uzb-Cyrl	Uzbek (Cyrillic)
uzb-Latn	Uzbek (Latin)

Language “Spaces”

Code	Language	Note
tur-with_attached_suffixes-space	Turkish	Based on data with suffixes attached
tur-without_attached_suffixes-space	Turkish	Based on data with suffixes removed
uzb-with_attached_suffixes-space	Uzbek	Based on data with suffixes attached

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.25

Mar 8, 2024

1.24

Sep 27, 2022

1.23

Sep 13, 2022

1.22

Jun 11, 2022

1.21

Jun 6, 2022

1.20

Jun 6, 2022

1.19

May 17, 2022

1.18

Apr 15, 2022

1.17

Apr 4, 2022

1.16

Feb 21, 2022

1.15

Nov 12, 2021

1.14

Nov 11, 2021

1.13

Nov 11, 2021

1.12

Oct 4, 2021

1.11

Apr 27, 2021

1.10

Apr 22, 2021

1.9

Jan 19, 2021

1.8

Nov 25, 2019

1.7

Nov 25, 2019

1.6

Nov 22, 2019

1.5

Nov 20, 2019

1.4

Nov 12, 2019

1.3

Nov 11, 2019

1.2

Oct 13, 2019

1.1

Aug 2, 2019

1.0

Aug 1, 2019

0.73

Jul 31, 2019

0.72

Jul 31, 2019

0.71

Jul 25, 2019

0.70

Jul 22, 2019

0.69

Jul 22, 2019

0.68

Jul 22, 2019

0.67

Jul 20, 2019

0.66

Jul 17, 2019

0.65

Jul 17, 2019

0.64

Jul 17, 2019

0.63

Jul 16, 2019

0.62

Jul 16, 2019

0.61

Jul 11, 2019

0.60

Jul 9, 2019

0.59

Jul 2, 2019

0.58

May 9, 2019

0.57

Oct 19, 2018

0.56

Jul 4, 2018

0.55

Jul 4, 2018

0.54

Jul 4, 2018

0.53

Jul 3, 2018

0.52

Jul 2, 2018

0.51

Jul 2, 2018

0.50

Jun 30, 2018

0.49

Jun 27, 2018

0.47

Jun 7, 2018

0.46

Jun 5, 2018

0.45

Jun 5, 2018

0.44

Apr 12, 2018

0.43

Apr 10, 2018

0.42

Apr 10, 2018

0.41

Apr 9, 2018

0.40

Apr 9, 2018

0.39

Mar 5, 2018

0.38

Feb 13, 2018

0.37

Oct 19, 2017

0.36

Oct 16, 2017

0.35

Aug 22, 2017

0.34

Aug 22, 2017

0.33

Aug 21, 2017

0.32

Aug 20, 2017

0.31

Aug 20, 2017

0.30

Aug 17, 2017

0.29

Aug 17, 2017

0.28

Aug 17, 2017

0.27

Aug 16, 2017

0.26

Aug 15, 2017

0.25

Aug 13, 2017

0.24

Aug 13, 2017

0.23

Aug 8, 2017

0.22

Aug 8, 2017

0.21

Aug 8, 2017

0.20

Aug 8, 2017

0.19

Jul 7, 2017

0.18

Apr 28, 2017

0.17

Apr 27, 2017

0.16

Apr 25, 2017

0.15

Apr 25, 2017

0.14

Apr 25, 2017

0.13

Apr 24, 2017

0.12

Apr 22, 2017

0.11

Apr 17, 2017

0.10

Apr 12, 2017

0.9

Apr 5, 2017

0.8

Apr 5, 2017

0.7

Feb 17, 2017

0.6

Feb 15, 2017

0.5

Feb 1, 2017

0.4

Aug 27, 2016

0.3

Jul 29, 2016

0.2

May 9, 2016

This version

0.1

Apr 30, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epitran-0.1.tar.gz (13.2 kB view hashes)

Uploaded Apr 30, 2016 Source

Hashes for epitran-0.1.tar.gz

Hashes for epitran-0.1.tar.gz
Algorithm	Hash digest
SHA256	`b3c614682bb2add983a57b2bed1665b0c157505285c9096b8a6651a76176a4b4`
MD5	`5eb5e020b2c6c53a7b5fd3ec779f75a2`
BLAKE2b-256	`04b61fcc130ab6c912b9e294f1ded121091eace528b9c8cc776f213a455c3491`