Skip to main content

Thai Language Toolkit

Project description

TLTK is a Python package designed for Thai language processing, which includes functionalities such as syllable and word segmentation, discourse unit segmentation, POS tagging, named entity recognition, grapheme-to-phoneme conversion, IPA transcription, romanization, and more. To use TLTK, you will need to have Python 3.6 or a more recent version installed. The project is an open-source software developed at Chulalongkorn University. As of version 1.2.2, the package license has been changed to the New BSD License (BSD-3-Clause).

Input : must be utf8 Thai texts.

Updates:

Version 1.8: Two new modules have been introduced: TextClass(text): This module is designed to assess the level of difficulty based on L1-L4 (Elementary, Lower Secondary, Middle, High School). It provides a mechanism for determining the text’s complexity level. txt2feat(text): This module is introduced to generate a vector of 130 features, represented as a list of values. These features are derived from the output of the TextAna module. Within these modules, dependency relations in ‘wrd_deprel[deprel]’ have been transformed into UD format, such as UDsubj, UDobj, UDnmod, and so on. These 130 features are then generated and utilized within the TextClass module to evaluate text difficulty.”

Version 1.7: Introduced the spoonerism(w) module, which generates one or two spoonerisms from the input word w. This is achieved by swapping the first and last syllables, either a) preserving the initial consonant or b) preserving both the initial consonant and tone. The output is provided as a list of readings in Thai. Additionally, the dependency “sklearn” has been updated to “scikit-learn”.

Version 1.6.8: Bug fixes have been made to the “TextAna” module.

Version 1.6.7: Bug fixes have been made to the “g2p” module.

Version 1.6.6 includes UDParser using MaltParser (https://www.maltparser.org/). To use this feature, please install MaltParser and add a line ‘tltk.nlp.Maltparser_Path = “/path/to/maltparser-1.9.2”’ in your code before using ‘MaltParser’ or ‘MaltParser_wordlist’. The former requires text input while the latter requires a list of words. The UD tree generated by MaltParser is a dictionary with the following format: {‘sentence’: “ข้อความภาษาไทย”, ‘words’: [{‘id’: nn, ‘pos’: POS, ‘deprel’: REL, ‘head’: HD_ID}, {…}, …]}. You can use ‘print_dtree’ to print D-tree from the parsed result. Additionally, ‘delrel’ and ‘SynDepth’ have been added to the properties of ‘TextAna’ when the option ‘UDParse=”Malt”’ is specified. By default, ‘UDParse=”none”’.

Version 1.6.5: This version includes bug fixes in the “SylAna” and “WordAna” modules, as well as a new module called “tltk.corpus.compound(x,y)”.

Version 1.6.3: Bug fixes have been made to the “g2p” module, and some features have been modified in both “WordAna” and “TextAna” modules.

Version 1.6.2: Changes have been made to the text features in this version.

Version 1.6.1: This version includes new text features, an updated Word2Vec model using ‘TNCc5model3.bin’, a change from ‘g2p_all’ to ‘th2ipa_all’, and some bug fixes.

Version 1.6: The new feature in this version is ‘TNC_tag’, which allows you to mark up Thai text in XML format.

Version 1.5.8: This version includes the addition of average reduced frequency in the TextAna module.

Version 1.5.7: The SylAna module has been added, which is included in WordAna. The output is a list of syllable properties, which is added to the word property. Additionally, ‘th2read(text)’ has been added, which shows the pronunciation in Thai written forms.

Version 1.5: This version includes the addition of the WordAna and TextAna modules. The output of WordAna is an object with word properties.

The following line of code has also been mentioned: ‘res = tltk.nlp.TNC_tag(text,POS)’ returns XML format of Thai texts as used in TNC. The POS option can be set to either “Y” or “N”.

sp = tltk.nlp.SylAna(syl_form,syl_phone) => sp.form (syllable form), sp.phone (syllable sound), sp.char (number of characters in the syllable), sp.dead (indicates whether the syllable is dead or live, True/False), sp.initC (initial consonant form), sp.finalC (final consonant form), sp.vowel (vowel form), sp.tonemark (indicates the tone mark, เอก, โท, ตรี, จัตวา), sp.initPh (initial consonant sound), sp.finalPh (final consonant sound), sp.vowelPh (vowel sound), sp.tone (tone 1, 2, 3, 4, or 5), sp.leading (indicates whether the syllable is a leading syllable, True/False), sp.cluster (indicates whether the syllable has an initial cluster, True/False), sp.karan (number of characters marked with a karan marker)

wd = tltk.nlp.WordAna(w) => wd.form (word form), wd.phone (word sound), wd.char (number of characters in the word), wd.syl (number of syllables), wd.corrtone (number of tones that match the same tone marker), wd.corrfinal (number of final consonant sounds that match the final character -ก -ด -ง -น -ม -ย -ว), wd.karan (number of karan markers), wd.cluster (number of cluster consonants), wd.lead (number of leading consonants), wd.doubvowel (number of complex vowels), wd.syl_prop (a list of syllable properties)

res = tltk.nlp.TextAna(text, TextOption, WordOption) => a complex dictionary output describing the input text.

TextOption can be configured with one of the following values: “segmented,” “edu,” or “par.” To segment the text with <p>, <s>, and | representing a new paragraph, space, and word segmentation, select “segmented.” To apply TLTK EDU segmentation, choose “edu.” To process the text as plain text format using “\n” for paragraph separation, use “par.”

WordOption can be set to “colloc” or “mm”. If the text is not yet segmented, use “colloc” or “mm” to segment the text into words using TLTK.

### properties from SylAna

  • form: syllable form

  • phone: syllable sound

  • char: number of characters in the syllable

  • dead: True|False (indicates whether the syllable is dead or alive)

  • initC: initial consonant

  • finalC: final consonant

  • vowel: vowel form

  • tonemark: tone marker (values: 1, 2, 3, 4, 5)

  • initPh: initial sound

  • finalPh: final sound

  • vowelPh: vowel sound

  • tone: tone (values: 1, 2, 3, 4, 5)

  • leading: True|False (indicates whether the syllable is a leading syllable, e.g., in สบาย, สห)

  • cluster: True|False (indicates whether the syllable has a cluster consonant)

  • karan: character(s) marked with karan

### properties from WordAna

  • form: word form

  • phone: word sound

  • char: number of characters

  • syl: number of syllables

  • corrtone: number of correct tone markers (สามัญ, ่ เอก, ้ โท, ๊ ตรี, ๋ จัตวา) in both form and sound

  • incorrtone: number of incorrect tone markers in both form and sound

  • corrfinal: number of correct final consonants (-ก -ด -ง -น -ม -ย -ว)

  • incorrfinal: number of incorrect final consonants (excluding -ก -ด -ง -น -ม -ย -ว)

  • karan: number of karan markers

  • cluster: number of cluster consonants

  • lead: number of leading consonants

  • doubvowel: number of double vowels

### properties from TextAna

  • DesSpC: No. of spaces in a text

  • DesChaC: No. of characters in a text

  • DesSymbC: No. of symbols or special characters in a text

  • DesPC: No. of paragraphs

  • DesEduC: No. of edu units

  • DesTotW: Total number of words in a text

  • DesTotT: Total number of unique words (types) in a text

  • DesEduL: Mean length of an edu unit (in words)

  • DesEduLd: Standard deviation of edu length (in words)

  • DesWrdL: Mean length of a word (in syllables)

  • DesWrdLd: Standard deviation of word length (in syllables)

  • DesPL: Mean length of a paragraph (in words)

  • DesCorrToneC: Number of words with the correct tone form and tone sound

  • DesInCorrToneC: Number of words with incorrect tone form and/or tone sound

  • DesCorrFinalC: Number of words with correct final consonant (-ก -ด -ง -น -ม -ย -ว)

  • DesInCorrFinalC: Number of words with incorrect final consonant (not -ก -ด -ง -น -ม -ย -ว)

  • DesClusterC: Number of words with a consonant cluster

  • DesLeadC: Number of words with a leading syllable (e.g. สบาย, สห)

  • DesDoubVowelC: Number of words with a double vowel

  • DesTNCt1C: No. of words in TNC tier1 50%

  • DesTNCt2C: No. of words in TNC tier2 51-60%

  • DesTNCt3C: No. of words in TNC tier3 61-70%

  • DesTNCt4C: No. of words in TNC tier4 71-80%

  • DesTTC1: No. of words in TTC level1

  • DesTTC2: No. of words in TTC level2

  • DesTTC3: No. of words in TTC level3

  • DesTTC4: No. of words in TTC level4

  • WrdCorrTone: ratio of words with the same tone form and phone

  • WrdInCorrTone: ratio of words with different tone form and phone

  • WrdCorrFinal: ratio of words with correct final consonant -ก -ด -ง -น -ม -ย -ว

  • WrdInCorrFinal: ratio of words with final consonant not -ก -ด -ง -น -ม -ย -ว

  • WrdKaran: ratio of words with a karan

  • WrdCluster: ratio of words with a cluster

  • WrdLead: ratio of words with a leading syllable

  • WrdDoubVowel: ratio of words with a double vowel

  • WrdNEl: ratio of named entity locations

  • WrdNEo: ratio of named entity organizations

  • WrdNEp: ratio of named entity persons

  • WrdNeg: ratio of negations

  • WrdTNCt1: relative frequency of words in TNC tier 1 (/1000 words)

  • WrdTNCt2: relative frequency of words in TNC tier 2

  • WrdTNCt3: relative frequency of words in TNC tier 3

  • WrdTNCt4: relative frequency of words in TNC tier 4

  • WrdTTC1: relative frequency of words in TTC level 1

  • WrdTTC2: relative frequency of words in TTC level 2

  • WrdTTC3: relative frequency of words in TTC level 3

  • WrdTTC4: relative frequency of words in TTC level 4

  • WrdC: mean of relative frequency of content words in TTC

  • WrdF: mean of relative frequency of function words in TTC

  • WrdCF: mean of relative frequency of content/function words in TTC

  • WrdFrmSing: mean of relative frequency of single-word forms in TTC

  • WrdFrmComp: mean of relative frequency of complex/compound word forms in TTC

  • WrdFrmTran: mean of relative frequency of transliterated words in TTC

  • WrdSemSimp: mean of relative frequency of simple words in TTC

  • WrdSemTran: mean of relative frequency of transparent compound words in TTC

  • WrdSemSemi: mean of relative frequency of words in between transparent and opaque compound words in TTC

  • WrdSemOpaq: mean of relative frequency of opaque compound words in TTC

  • WrdBaseM: mean of relative frequency of basic vocab from Ministry of Education

  • WrdBaseT: mean of relative frequency of basic vocab from TTC & TNC < 2000

  • WrdTfidf: average of TF-IDF of each word (calculated from TNC)

  • WrdTncDisp: average of dispersion of each word (calculated from TNC)

  • WrdTtcDisp: average of dispersion of each word (calculated from TTC)

  • WrdArf: average of ARF (average reduced frequency) of each word in the text

  • WrdNOUN: mean of relative frequency of words with POS=NOUN

  • WrdVERB: mean of relative frequency of words with POS=VERB

  • WrdADV: mean of relative frequency of words with POS=ADV

  • WrdDET: mean of relative frequency of words with POS=DET

  • WrdADJ: mean of relative frequency of words with POS=ADJ

  • WrdADP: mean of relative frequency of words with POS=ADP

  • WrdPUNCT: mean of relative frequency of words with POS=PUNCT

  • WrdAUX: mean of relative frequency of words with POS=AUX

  • WrdSYM: mean of relative frequency of words with POS=SYM

  • WrdINTJ: mean of relative frequency of words with POS=INTJ

  • WrdCCONJ: mean of relative frequency of words with POS=CCONJ

  • WrdPROPN: mean of relative frequency of words with POS=PROPN

  • WrdNUM: mean of relative frequency of words with POS=NUM

  • WrdPART: mean of relative frequency of words with POS=PART

  • WrdPRON: mean relative frequency of words with POS=PRON

  • WrdSCONJ: mean relative frequency of words with POS=SCONJ

  • LdvTTR: type-token ratio, which is the ratio of the number of unique words (types) to the total number of words (tokens) in a text

  • CrfCNL: proportion of utterances having the same NOUN overlapped locally (yes or no)

  • CrfCVL: proportion of utterances having the same VERB overlapped locally (yes or no)

  • CrfCWL: proportion of utterances having the same content words overlapped locally (yes or no)

  • CrfCTL: proportion of utterances having content words overlapped locally (measured by the number of overlapping tokens)

  • wrd: dictionary where wrd[word] = freq, representing the frequency of each word in a text

  • wrd_arf: dictionary where wrd_arf[word] = arf, representing the average reduced frequency of each word in a text

  • wrd_deprel: dictionary where wrd_deprel[deprel] = freq, representing the frequency of each dependency relation (deprel) in a text

Version 1.4 has been updated for gensim 4.0. Users can load a Thai corpus using Corpus(), then create a model using W2V_train() or D2V_train(), or load an existing model from W2V_load(Model_File). The pre-trained w2v model for TNC is TNCc5model2.bin. The model for EDU segmentation has been recompiled to work with the new library.

Version 1.3.8 has added spell_variants to generate all variation forms of the same pronunciation.

Version 1.3.6 has removed the “matplotlib” dependency and fixed an error with “ใคร”.

More compound words have been added to the dictionary. Versions 1.1.3-1.1.5 contained many entries that were not words and had a few errors. Those entries have been removed in later versions.

The NER tagger model has been updated by using more named entity data from the AiforThai project.

tltk.nlp : basic tools for Thai language processing.

>tltk.nlp.TextClass(text) By default, TextOption=”par”,WordOption=”colloc”, UDParse=”Malt”, Classifier=”level” is set. If text is word segmented with “|”, use WordOption=”segmented”

>tltk.nlp.txt2feat(text, Option=”name|value”): Returns a list of 130 feature values analyzed from the text. If Option=”name”, returns a list of 130 feature names.

>tltk.nlp.spoonerism(word_or_phrase): Returns one or two “spoonerisms” derived from the input. For example, using spoonerism(‘แขนเป็นฟอ’) will produce the spoonerism(s).

=>[‘คอ-เป็น-แฝน’, ‘ขอ-เป็น-แฟน’]

>tltk.nlp.TextAna(Text, UDParse=”Malt”): This function analyzes plain text by paragraph, segments words using the colloc approach, and employs MaltParse for UDParsing. The default options are TextOption=”par”, WordOption=”colloc”, and UDParse=”none”. If the input is already segmented with ‘|’, then use TextOption=”segmented” and WordOption=”segmented”. If processing by ‘edu’ is preferred, set TextOption=”edu”.

=>output as a dict of text features described in TextAna

>tltk.nlp.TextAna2json(Text, Filename, Options) functions similarly to the above, but the results are saved to a JSON file. The Options parameter includes a Mode which can be set to “write” or “append”.

>tltk.nlp.MaltParser(Text) e.g. print_dtree(tltk.nlp.MaltParser(“เขานั่งดูหนังอยู่ที่บ้าน”))

=>

  • 1:—-เขา (PRON, nsubj - 2)

  • 2:–นั่ง (VERB, root - 0)

  • 3:—-ดู (VERB, compound - 2)

  • 4:——หนัง (NOUN, obj - 3)

  • 5:——อยู่ (VERB, compound - 3)

  • 6:———-ที่ (ADP, case - 7)

  • 7:——–บ้าน (NOUN, obl - 5)

>tltk.nlp.TNC_tag(Text,POSTagOption) e.g. tltk.nlp.TNC_tag(‘นายกรัฐมนตรีกล่าวกับคนขับรถประจำทางหลวงสายสองว่า อยากวิงวอนให้ใช้ความรอบคอบ’,POS=’Y’)

=> ‘<w tran=”naa0jok3rat3tha1mon0trii0” POS=”NOUN”>นายกรัฐมนตรี</w><w tran=”klaaw1” POS=”VERB”>กล่าว</w><w tran=”kap1” POS=”ADP”>กับ</w><w tran=”khon0khap1rot3” POS=”NOUN”>คนขับรถ</w><w tran=”pra1cam0” POS=”NOUN”>ประจำ</w><w tran=”thaaN0luuaN4” POS=”NOUN”>ทางหลวง</w><w tran=”saaj4” POS=”NOUN”>สาย</w><w tran=”sOON4” POS=”NUM”>สอง</w><w tran=”waa2” POS=”SCONJ”>ว่า</w><s/><w tran=”jaak1” POS=”VERB”>อยาก</w><w tran=”wiN0wOOn0” POS=”VERB”>วิงวอน</w><w tran=”haj2” POS=”SCONJ”>ให้</w><w tran=”chaj3” POS=”VERB”>ใช้</w><w tran=”khwaam0” POS=”NOUN”>ความ</w><w tran=”rOOp2khOOp2” POS=”VERB”>รอบคอบ</w><s/>’

>tltk.nlp.chunk(Text) : chunk parsing. The output includes markups for word segments (|), elementary discourse units (<u/>), pos tags (/POS),and named entities (<NEx>…</NEx>), e.g. tltk.nlp.chunk(“สำนักงานเขตจตุจักรชี้แจงว่า ได้นำป้ายประกาศเตือนปลิงไปปักตามแหล่งน้ำ ในเขตอำเภอเมือง จังหวัดอ่างทอง หลังจากนายสุกิจ อายุ 65 ปี ถูกปลิงกัดแล้วไม่ได้ไปพบแพทย์”)

=> ‘<NEo>สำนักงาน/NOUN|เขต/NOUN|จตุจักร/PROPN|</NEo>ชี้แจง/VERB|ว่า/SCONJ|<s/>/PUNCT|ได้/AUX|นำ/VERB|ป้ายประกาศ/NOUN|เตือน/VERB|ปลิง/NOUN|ไป/VERB|ปัก/VERB|ตาม/ADP|แหล่งน้ำ/NOUN|<u/>ใน/ADP|<NEl>เขต/NOUN|อำเภอ/NOUN|เมือง/NOUN|<s/>/PUNCT|จังหวัด/NOUN|อ่างทอง/PROPN|</NEl><u/>หลังจาก/SCONJ|<NEp>นาย/NOUN|สุ/PROPN|กิจ/NOUN|</NEp><s/>/PUNCT|อายุ/NOUN|<u/>65/NUM|<s/>/PUNCT|ปี/NOUN|<u/>ถูก/AUX|ปลิง/VERB|กัด/VERB|แล้ว/ADV|ไม่ได้/AUX|ไป/VERB|พบ/VERB|แพทย์/NOUN|<u/>’

>tltk.nlp.segment(Text) : segment edu by marking <u/> e.g. tltk.nlp.segment(“แต่อาจเพราะนกกินปลีอกเหลืองเป็นพ่อแม่มือใหม่ รังที่ทำจึงไม่ค่อยแข็งแรง วันหนึ่งรังก็ฉีกเกือบขาดเป็นสองท่อนห้อยต่องแต่ง ผมพยายามหาอุปกรณ์มายึดรังกลับคืนรูปทรงเดิม ขณะที่แม่นกกินปลีอกเหลืองส่งเสียงโวยวายอยู่ใกล้ ๆ แต่สุดท้ายไม่สำเร็จ สองสามวันต่อมารังที่ช่วยซ่อมก็พังไป ไม่เห็นแม่นกบินกลับมาอีกเลย”)

=>”แต่|อาจ|เพราะ|นกกินปลีอกเหลือง|เป็น|พ่อแม่|มือใหม่|<s/>|รัง|ที่|ทำ|จึง|ไม่ค่อย|แข็งแรง<u/>วัน|หนึ่ง|รัง|ก็|ฉีก|เกือบ|ขาด|เป็น|สอง|ท่อน|ห้อย|ต่องแต่ง<u/>ผม|พยายาม|หา|อุปกรณ์|มา|ยึด|รัง|กลับคืน|รูปทรง|เดิม<u/>ขณะที่|แม่|นกกินปลีอกเหลือง|ส่งเสียง|โวยวาย|อยู่|ใกล้|ๆ|<s/><u/>แต่|สุดท้าย|ไม่|สำเร็จ<u/>สอง|สาม|วัน|ต่อมา|รัง|ที่|ช่วย|ซ่อม|ก็|พัง|ไป<u/>ไม่|เห็น|แม่|นก|บิน|กลับ|มา|อีก|เลย<u/>”

>tltk.nlp.ner_tag(Text) : The output includes markups for named entities (<NEx>…</NEx>), e.g. tltk.nlp.ner_tag(“สำนักงานเขตจตุจักรชี้แจงว่า ได้นำป้ายประกาศเตือนปลิงไปปักตามแหล่งน้ำ ในเขตอำเภอเมือง จังหวัดอ่างทอง หลังจากนายสุกิจ อายุ 65 ปี ถูกปลิงกัดแล้วไม่ได้ไปพบแพทย์”)

=> ‘<NEo>สำนักงานเขตจตุจักร</NEo>ชี้แจงว่า ได้นำป้ายประกาศเตือนปลิงไปปักตามแหล่งน้ำ ใน<NEl>เขตอำเภอเมือง จังหวัดอ่างทอง</NEl> หลังจาก<NEp>นายสุกิจ</NEp> อายุ 65 ปี ถูกปลิงกัดแล้วไม่ได้ไปพบแพทย์’

>tltk.nlp.ner([(w,pos),….]) : module for named entity recognition (person, organization, location), e.g. tltk.nlp.ner([(‘สำนักงาน’, ‘NOUN’), (‘เขต’, ‘NOUN’), (‘จตุจักร’, ‘PROPN’), (‘ชี้แจง’, ‘VERB’), (‘ว่า’, ‘SCONJ’), (’<s/>’, ‘PUNCT’)])

=> [(‘สำนักงาน’, ‘NOUN’, ‘B-O’), (‘เขต’, ‘NOUN’, ‘I-O’), (‘จตุจักร’, ‘PROPN’, ‘I-O’), (‘ชี้แจง’, ‘VERB’, ‘O’), (‘ว่า’, ‘SCONJ’, ‘O’), (’<s/>’, ‘PUNCT’, ‘O’)] Named entity recognition is based on the CRF model adapted from the http://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html tutorial. The model was trained on a corpus containing 170,000 named entities. The tags used for organizations are B-O and I-O, for persons are B-P and I-P, and for locations are B-L and I-L.

>tltk.nlp.pos_tag(Text,WordSegmentOption) : word segmentation and POS tagging (using nltk.tag.perceptron), e.g. tltk.nlp.pos_tag(‘โปรแกรมสำหรับใส่แท็กหมวดคำภาษาไทย วันนี้ใช้งานได้บ้างแล้ว’) or

=> [[(‘โปรแกรม’, ‘NOUN’), (‘สำหรับ’, ‘ADP’), (‘ใส่’, ‘VERB’), (‘แท็ก’, ‘NOUN’), (‘หมวดคำ’, ‘NOUN’), (‘ภาษาไทย’, ‘PROPN’), (’<s/>’, ‘PUNCT’)], [(‘วันนี้’, ‘NOUN’), (‘ใช้งาน’, ‘VERB’), (‘ได้’, ‘ADV’), (‘บ้าง’, ‘ADV’), (‘แล้ว’, ‘ADV’), (’<s/>’, ‘PUNCT’)]]

The default word segmentation method used is “colloc” in the function word_segment(Text, “colloc”), but if the option is set to “mm”, then the function word_segment(Text, “mm”) will be used. The POS tag set used is based on the Universal POS tag set found at http://universaldependencies.org/u/pos/index.html. The nltk.tag.perceptron model is used for POS tagging, which was trained on a POS-tagged subcorpus in TNC consisting of 148,000 words.

nltk.tag.perceptron model is used for POS tagging. It is trainned with POS-tagged subcorpus in TNC (148,000 words)

>tltk.nlp.pos_tag_wordlist(WordLst) : Same as “tltk.nlp.pos_tag”, but the input is a word list, [w1,w2,…]

>tltk.nlp.segment(Text) : segment a paragraph into elementary discourse units (edu) marked with <u/> and segment words in each edu e.g. tltk.nlp.segment(“แต่อาจเพราะนกกินปลีอกเหลืองเป็นพ่อแม่มือใหม่ รังที่ทำจึงไม่ค่อยแข็งแรง วันหนึ่งรังก็ฉีกเกือบขาดเป็นสองท่อนห้อยต่องแต่ง ผมพยายามหาอุปกรณ์มายึดรังกลับคืนรูปทรงเดิม ขณะที่แม่นกกินปลีอกเหลืองส่งเสียงโวยวายอยู่ใกล้ ๆ แต่สุดท้ายไม่สำเร็จ สองสามวันต่อมารังที่ช่วยซ่อมก็พังไป ไม่เห็นแม่นกบินกลับมาอีกเลย”)

=> ‘แต่|อาจ|เพราะ|นกกินปลีอกเหลือง|เป็น|พ่อแม่|มือใหม่|<s/>|รัง|ที่|ทำ|จึง|ไม่|ค่อย|แข็งแรง<u/>วัน|หนึ่ง|รัง|ก็|ฉีก|เกือบ|ขาด|เป็น|สอง|ท่อน|ห้อย|ต่องแต่ง<u/>ผม|พยายาม|หา|อุปกรณ์|มา|ยึด|รัง|กลับคืน|รูปทรง|เดิม<u/>ขณะ|ที่|แม่|นกกินปลีอกเหลือง|ส่งเสียง|โวยวาย|อยู่|ใกล้|ๆ<u/>แต่|สุดท้าย|ไม่|สำเร็จ|<s/>|สอง|สาม|วัน|ต่อ|มา|รัง|ที่|ช่วย|ซ่อม|ก็|พัง|ไป<u/>ไม่|เห็น|แม่|นก|บิน|กลับ|มา|อีก|เลย<u/>’ edu segmentation is based on syllable input using RandomForestClassifier model, which is trained on an edu-segmented corpus (approx. 7,000 edus) created and used in Nalinee's thesis

>tltk.nlp.word_segment(Text,method=’mm|ngram|colloc’) : word segmentation using either maximum matching or ngram or maximum collocation approach. ‘colloc’ is used by default. Please note that the first run of ngram method would take a long time because TNC.3g will be loaded for ngram calculation. e.g.

>tltk.nlp.word_segment(‘ผู้สื่อข่าวรายงานว่านายกรัฐมนตรีไม่มาทำงานที่ทำเนียบรัฐบาล’) => ‘ผู้สื่อข่าว|รายงาน|ว่า|นายกรัฐมนตรี|ไม่|มา|ทำงาน|ที่|ทำเนียบรัฐบาล|<s/>’

>tltk.nlp.syl_segment(Text) : syllable segmentation using 3gram statistics e.g. tltk.nlp.syl_segment(‘โปรแกรมสำหรับประมวลผลภาษาไทย’)

=> ‘โปร~แกรม~สำ~หรับ~ประ~มวล~ผล~ภา~ษา~ไทย<s/>’

>tltk.nlp.word_segment_nbest(Text, N) : return the best N segmentations based on the assumption of minimum word approach. e.g. tltk.nlp.word_segment_nbest(‘คนขับรถประจำทางปรับอากาศ”’,10)

=> [[‘คนขับ|รถประจำทาง|ปรับอากาศ’, ‘คนขับรถ|ประจำทาง|ปรับอากาศ’, ‘คน|ขับ|รถประจำทาง|ปรับอากาศ’, ‘คน|ขับรถ|ประจำทาง|ปรับอากาศ’, ‘คนขับ|รถ|ประจำทาง|ปรับอากาศ’, ‘คนขับรถ|ประจำ|ทาง|ปรับอากาศ’, ‘คนขับ|รถประจำทาง|ปรับ|อากาศ’, ‘คนขับรถ|ประจำทาง|ปรับ|อากาศ’, ‘คน|ขับ|รถ|ประจำทาง|ปรับอากาศ’, ‘คนขับ|ร|ถ|ประจำทาง|ปรับอากาศ’]]

>tltk.nlp.g2p(Text) : return Word segments and pronunciations e.g. tltk.nlp.g2p(“สถาบันอุดมศึกษาไม่สามารถก้าวให้ทันการเปลี่ยนแปลงของตลาดแรงงาน”)

=> “สถา~บัน~อุ~ดม~ศึก~ษา|ไม่|สา~มารถ|ก้าว|ให้|ทัน|การ|เปลี่ยน~แปลง|ของ|ตลาด~แรง~งาน<tr/>sa1’thaa4~ban0~?u1~dom0~sUk1~saa4|maj2|saa4~maat2|kaaw2|haj2|than0|kaan0|pliian1~plxxN0|khOON4|ta1’laat1~rxxN0~Naan0|<s/>”

>tltk.nlp.th2ipa(Text) : return Thai transcription in IPA forms e.g. tltk.nlp.th2ipa(“ลงแม่น้ำรอเดินไปหาปลา”)

=> ‘loŋ1 mɛː3.naːm4 rᴐː1 dɤːn1 paj1 haː5 plaː1 <s/>’

>tltk.nlp.th2roman(Text) : return Thai romanization according to Royal Thai Institute guideline. .e.g. tltk.nlp.th2roman(“คือเขาเดินเลยลงไปรอในแม่น้ำสะอาดไปหามะปราง”)

=> ‘khue khaw doen loei long pai ro nai maenam sa-at pai ha maprang <s/>’

>tltk.nlp.th2read(Text) : convert text into Thai reading forms, e.g. th2read(‘สามารถเขียนคำอ่านภาษาไทยได้’)

=> ‘สา-มาด-เขียน-คัม-อ่าน-พา-สา-ไท-ด้าย-’

>tltk.nlp.th2ipa_all(Text) : return all transcriptions (IPA) as a list of tuple (syllable_list, transcription). Transcription is based on syllable reading rules. It could be different from th2ipa. e.g. tltk.nlp.th2ipa_all(“รอยกร่าง”)

=> [(‘รอย~กร่าง’, ‘rᴐːj1.ka2.raːŋ2’), (‘รอย~กร่าง’, ‘rᴐːj1.kraːŋ2’), (‘รอ~ยก~ร่าง’, ‘rᴐː1.jok4.raːŋ3’)]

>tltk.nlp.spell_candidates(Word) : list of possible correct words using minimum edit distance, e.g. tltk.nlp.spell_candidates(‘รักษ’)

=> [‘รัก’, ‘ทักษ’, ‘รักษา’, ‘รักษ์’]

>tltk.nlp.spell_variants(Word, InDict=”no|yes”, Karan=”exclude|include”):

This function returns a list of word variants with the same pronunciation as the input Word. The InDict parameter allows the option “yes” to save only words found in the dictionary, while the default option “no” includes all variants regardless of their dictionary status. The Karan parameter allows the option “include” to include words spelled with the karan character, while the default option “exclude” excludes them. For example, tltk.nlp.spell_variants(‘โควิด’).

=> [‘โฆวิธ’, ‘โฆวิต’, ‘โฆวิด’, ‘โฆวิท’, ‘โฆวิช’, ‘โฆวิจ’, ‘โฆวิส’, ‘โฆวิษ’, ‘โฆวิตร’, ‘โฆวิฒ’, ‘โฆวิฏ’, ‘โฆวิซ’, ‘โควิธ’, ‘โควิต’, ‘โควิด’, ‘โควิท’, ‘โควิช’, ‘โควิจ’, ‘โควิส’, ‘โควิษ’, ‘โควิตร’, ‘โควิฒ’, ‘โควิฏ’, ‘โควิซ’]

Other defined functions in the package: >tltk.nlp.reset_thaidict() : clear dictionary content >tltk.nlp.read_thaidict(DictFile) : add a new dictionary e.g. tltk.nlp.read_thaidict(‘BEST.dict’) >tltk.nlp.check_thaidict(Word) : check whether Word exists in the dictionary

tltk.corpus : basic tools for corpus enquiry

>tltk.corpus.compound(w1, w2): Evaluates the similarity between combinations of w1 and w2, specifically w1-w2, w1-w1w2, and w2-w1w2. For instance, invoking tltk.corpus.compound(‘กลัด’,’กลุ้ม’) indicates that ‘กลัดกลุ้ม’ is more similar to ‘กลุ้ม’.

=>[((‘กลุ้ม’, ‘กลัดกลุ้ม’), 0.42245594), ((‘กลัด’, ‘กลัดกลุ้ม’), 0.09066804), ((‘กลัด’, ‘กลุ้ม’), 0.0011619462)]

>tltk.corpus.Corpus_build(DIR, filetype=”xxx”) creates a corpus as a list of paragraphs from files located in the directory specified by DIR. The default file type is .txt. However, it is important to note that the files must be pre-segmented into words, with each word separated by the | character, e.g. w1|w2|w3|w4 ….

>tltk.corpus.Corpus() creates a corpus object that has three methods:

  • x.frequency(Text): This method returns the frequency of a specific Text string in the corpus.

  • x.dispersion(C): This method returns a dispersion plot for a given word list C in the corpus.

  • x.totalword(C): This method returns the total number of words in the corpus that match a given word list C.

Here, C is the result created from Corpus_build.

>C = tltk.corpus.Copus_build(‘temp/data/’)

>corp = tltk.corpus.Corpus()

>print(corp.frequency(C))

> {‘จังหวัด’: 32, ‘สมุทรสาคร’: 16, ‘เปิด’: 3, ‘ศูนย์’: 13, ‘ควบคุม’: 13, ‘แจ้ง’: 16, …..}

>tltk.corpus.Xwordlist() creates a comparison object that compares two word lists A and B generated from the Corp.frequency() method. The Corp object is created from Corpus().

Four comparison methods are defined in this object:

  • onlyA(): This method returns the list of words that occur only in A.

  • onlyB(): This method returns the list of words that occur only in B.

  • intersect(): This method returns the list of words that occur in both A and B.

  • union(): This method returns the list of words that occur in either A or B (or both).

Here, c1 and c2 are Corpus() objects created using Corpus_build(…). Xcomp is a Xwordlist() object. parsA and parsB are word lists created from the Corpus_build(…) method.

For example, Xcomp.onlyA(c1.frequency(parsA), c2.frequency(parsB)).

>tltk.corpus.Xwordlist() create an object which is a comparison of two wordlists A and B. Four comparison methods are defined: onlyA, onlyB, intersect, union. A and B is an object created from Corp.frequency(). Corp is an object created from Corpus() e.g. Xcomp.onlyA(c1.frequency(parsA),c2.frequency(parsB))); c1 = Corpus(); c2 = Corpus(); Xcomp = Xwordlist(); parsA and parsB are created from Corpus_build(…)

>tltk.corpus.W2V_train(Corpus) create a model of Word2Vec. Input is a corpus created from Corpus_build.

>tltk.corpus.D2V_train(Corpus) create a model of Doc2Vec. Input is a corpus created from Corpus_build.

>tltk.corpus.TNC_load() by default load TNC.3g. The file can be in the working directory or TLTK package directory

>tltk.corpus.trigram_load(TRIGRAM) load Trigram data from other sourse saved in tab delimited format “W1tW2tW3tFreq” e.g. tltk.corpus.load3gram(‘TNC.3g’) ‘TNC.3g’ can be downloaded separately from Thai National Corpus Project.

>tltk.corpus.unigram(w1) return normalized frequecy (frequency/million) of w1 from the corpus

>tltk.corpus.bigram(w1,w2) return frequency/million of Bigram w1-w2 from the corpus e.g. tltk.corpus.bigram(“หาย”,”ดี”) => 2.331959592765809

>tltk.corpus.trigram(w1,w2,w3) return frequency/million of Trigram w1-w2-w3 from the corpus

>tltk.corpus.collocates(w, stat=”chi2”, direct=”both”, span=2, limit=10, minfq=1) ### return all collocates of w, STAT = {freq,mi,chi2} DIR={left,right,both} SPAN={1,2} The output is a list of tuples ((w1,w2), stat). e.g. tltk.corpus.collocates(“วิ่ง”,limit=5)

=> [((‘วิ่ง’, ‘แจ้น’), 86633.93952758134), ((‘วิ่ง’, ‘ตื๋อ’), 77175.29122642518), ((‘วิ่ง’, ‘กระหืดกระหอบ’), 48598.79465339733), ((‘วิ่ง’, ‘ปรู๊ด’), 41111.63720974819), ((‘ลู่’, ‘วิ่ง’), 33990.56839021914)]

>tltk.corpus.W2V_load(File) load w2v model created from gensim. If no file is given, file “TNCc5model3.bin” will be loaded.

>tltk.corpus.w2v_load() by deafult load word2vec file “TNCc5model2.bin”. The file can be in the working directory or TLTK package directory

>tltk.corpus.w2v_exist(w) check whether w has a vector representation e.g. tltk.corpus.w2v_exist(“อาหาร”) => True

>tltk.corpus.w2v(w) return vector representation of w

>tltk.corpus.similarity(w1,w2) e.g. tltk.corpus.similarity(“อาหาร”,”อาหารว่าง”) => 0.783551877546

>tltk.corpus.similar_words(w, n=10, cutoff=0., score=”n”) e.g. tltk.corpus.similar_words(“อาหาร”,n=5, score=”y”)

=> [(‘อาหารว่าง’, 0.7835519313812256), (‘ของว่าง’, 0.7366500496864319), (‘ของหวาน’, 0.703102707862854), (‘เนื้อสัตว์’, 0.6960341930389404), (‘ผลไม้’, 0.6641997694969177)]

>tltk.corpus.outofgroup([w1,w2,w3,…]) e.g. tltk.corpus.outofgroup([“น้ำ”,”อาหาร”,”ข้าว”,”รถยนต์”,”ผัก”]) => “รถยนต์”

>tltk.corpus.analogy(w1,w2,w3,n=1) e.g. tltk.corpus.analogy(‘พ่อ’,’ผู้ชาย’,’แม่’) => [‘ผู้หญิง’]

>tltk.corpus.w2v_plot([w1,w2,w3,…]) => plot a scratter graph of w1-wn in two dimensions

>tltk.corpus.w2v_compare_color([w1,w2,w3,…]) => visualize the components of vectors w1-wn in color

>tltk.corpus.compound(w1,w2) => check a compound w1w2, whether w1 or w2 is similar to w1w2 e.g. tltk.corpus.compound(‘เล็ก’,’น้อย’) => [((‘เล็ก’, ‘น้อย’), 0.4533272), ((‘น้อย’, ‘เล็กน้อย’), 0.35492077), ((‘เล็ก’, ‘เล็กน้อย’), 0.24106339)]

Notes

  • The word segmentation method used is based on a maximum collocation approach, which is described in the publication “Collocation and Thai Word Segmentation” by W. Aroonmanakun (2002). This publication can be found in the Proceedings of the Fifth Symposium on Natural Language Processing & The Fifth Oriental COCOSDA Workshop, edited by Thanaruk Theeramunkong and Virach Sornlertlamvanich, and published by Sirindhorn International Institute of Technology in Pathumthani. The relevant pages are 68-75. Here is the link to the publication: http://pioneer.chula.ac.th/~awirote/ling/SNLP2002-0051c.pdf

  • To segment Thai texts, you can use either tltk.nlp.word_segment(Text) or tltk.nlp.syl_segment(Text). The syllable segmentation method is based on a trigram model trained on a corpus of 3.1 million syllables. The input text should be a paragraph of Thai text that may contain English text. Spaces in the paragraph should be marked as “<s/>”. Word boundaries are marked by “|”, and syllable boundaries are marked by “~”. Please note that the syllables represented here are written syllables. Some written syllables may be pronounced as two syllables. For example, “สกัด” is segmented here as one written syllable, but it is pronounced as two syllables “sa1-kat1”.

  • The process of determining words in a sentence is based on a combination of a dictionary and the maximum collocation strength between syllables. The standard dictionary includes many compounds and idioms, such as ‘เตาไมโครเวฟ’, ‘ไฟฟ้ากระแสสลับ’, ‘ปีงบประมาณ’, ‘อุโมงค์ใต้ดิน’, ‘อาหารจานด่วน’, ‘ปูนขาวผสมพิเศษ’, ‘เต้นแร้งเต้นกา’, etc. These will likely be segmented as one word. If your application requires the use of shortest meaningful words (i.e. ‘รถ|โดยสาร’, ‘คน|ใช้’, ‘กลาง|คืน’, ‘ต้น|ไม้’, as segmented in the BEST corpus), you can reset the default dictionary used in this package and load a new dictionary containing only simple words or the shortest meaningful words. To clear the default dictionary content, use “reset_thaidict()”. To load a new dictionary, use “read_thaidict(‘DICT_FILE’)”. A file named ‘BEST.dict’ containing a list of words compiled from the BEST corpus is included in this package.

  • The standard dictionary used in this package has more than 65,000 entries, including abbreviations and transliterations, compiled from various sources. Additionally, a list of 8,700 proper names such as country names, organization names, location names, animal names, plant names, food names, etc., has been added to the system’s dictionary. Examples of such proper names include ‘อุซเบกิสถาน’, ‘สำนักเลขาธิการนายกรัฐมนตรี’, ‘วัดใหญ่สุวรรณาราม’, ‘หนอนเจาะลำต้นข้าวโพด’, and ‘ปลาหมึกกระเทียมพริกไทย’.

  • For segmenting a specific domain text, a specialized dictionary can be used by adding it to the existing dictionary before segmenting the text. This can be done by calling read_thaidict(“SPECIALIZED_DICT”). Please note that the dictionary should be a text file in “utf-8” encoding, and each word should be on a separate line.

  • ‘Sentence segmentation’ or actually ‘EDU segmentation’ is a process of breaking a paragraph into chunks of discourse units, which are usually clauses. It is based on a RandomForestClassifier model, which is trained on an EDU-segmented corpus (8,100 EDUs) created and used in Nalinee’s thesis (http://www.arts.chula.ac.th/~ling/thesis/2556MA-LING-Nalinee.pdf). The model has an accuracy of 97.8%. The reason behind using EDUs can be found in [Aroonmanakun, W. 2007. Thoughts on Word and Sentence Segmentation in Thai. In Proceedings of the Seventh Symposium on Natural Language Processing, Dec 13-15, 2007, Pattaya, Thailand. 85-90.] [Intasaw, N. and Aroonmanakun, W. 2013. Basic Principles for Segmenting Thai EDUs. in Proceedings of 27th Pacific Asia Conference on Language, Information, and Computation, pages 491-498, Nov 22-24, 2013, Taipei.].

  • ‘grapheme to phoneme’ (g2p), as well as IPA transcription (th2ipa) and Thai romanization (th2roman) are based on the hybrid approach presented in the paper “A Unified Model of Thai Word Segmentation and Romanization”. The Thai Royal Institute guideline for Thai romanization can be downloaded from “http://www.arts.chula.ac.th/~ling/tts/ThaiRoman.pdf”, or “http://www.royin.go.th/?page_id=619”. [Aroonmanakun, W., and W. Rivepiboon. 2004. A Unified Model of Thai Word Segmentation and Romanization. In Proceedings of The 18th Pacific Asia Conference on Language, Information and Computation, Dec 8-10, 2004, Tokyo, Japan. 205-214.] (http://www.aclweb.org/anthology/Y04-1021)

Remarks

tag

precision

recall

f1-score

support

B-L

0.56

0.48

0.52

27105

B-O

0.72

0.58

0.64

59613

B-P

0.82

0.83

0.83

83358

I-L

0.52

0.43

0.47

17859

I-O

0.67

0.59

0.63

67396

I-P

0.85

0.88

0.86

175069

O

0.92

0.94

0.93

1032377

accuracy

0.88

1462777

macro avg

0.72

0.68

0.70

1462777

weighted avg

0.87

0.88

0.88

1462777

Use cases

This package is free for commercial use. If you incorporate this package in your work, we would appreciate it if you inform us through awirote@chula.ac.th.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tltk-1.8.tar.gz (18.9 MB view hashes)

Uploaded Source

Built Distribution

tltk-1.8-py3-none-any.whl (19.0 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page