oice.langdet 1.0dev-r781
Automatic Language Detector
Language Detector
This is a simple (yet powerful) automatic language detector. Currently the only languages we are capable to detect are:
- English
- Spanish
- French
Installation and Usage
To install just run the easy_install tool:
easy_install oice.langdet
This will install a console script langdet. Run langdet passing a plain text filename as the first parameter. Examples:
langdet simple.txt
This will return the 2-letters ISO 639-1 code of the detected language.
You may also use oice.langdet in Python scripts like this:
#!/usr/bin/env python2.5
from StringIO import StringIO
from oice.langdet import langdet
from oice.langdet import streams
from oice.langdet import languages
text = streams.Stream(StringIO(u"Must be a Python Unicode text"))
lang = langdet.LanguageDetector.detect(text)
if lang == languages.spanish:
print u'Texto en español'
elif lang == languages.english:
print u'English text'
else:
print u'France' # I don't speak/write French
Caveats
Currently there are some restrictions:
langdet does not work properly with standard input nor pipelines.
You cannot use a file-like object directly with LanguageDetector, i.e, you must use the Stream wrapper.
This is so because we try to guess the text encoding and normalize it to a Python Unicode String. However, we plan to remove this normalization step and count the frequency of octets and pairs of octets instead.
If the piece of text is not written in any of the languages we can detect, the best match (see How it works) is selected.
Work in progress
In a sentence: trying to solve the first two caveats, and thinking in Python 2.6 and Python 3.0.
How it works
Language detection is based on stats on the frequency of letters and pairs of letters of the input text.
The modules in the package oice.language.languages contains a "footprint" of text in those languages.
The texts used in the generation of the footprints were:
- El ingenioso hidalgo Don Quijote de la Mancha
- The Holly Bible
- La Folle Journée, ou Le Mariage de Figaro
When trying to detect the language of some piece of text, first we count the frequencies of letters and pairs of letters in the text and then compare the results with the footprints of those language, the best match is selected.
We use the simple cosine similarity equation to compare the text with the footprints of those texts.
Accuracy of the detection
To test the accuracy of this implementation we downloaded the full European Parliament Proceedings Parallel Corpus 1996-2006 and ran the langdet script to the sets of English, Spanish and French documents.
For each language we count the times the correct ISO 639-1 code was returned by langdet like this (for counting documents detected as Spanish written):
find -type f -exec langdet {} \; | grep es | wc -l
The results are summarized in the following table:
| Real language | English | Spanish | French | Errors [1] |
|---|---|---|---|---|
| English | 98.78% | 0% | 0% | 1.22% |
| Spanish | 0% | 100% | 0% | 0% |
| French | 0% | 0% | 100% | 0% |
| Danish | 1.22% | 16.08% | 82.7% | 0% |
| German | 1.97% | 0.15% | 97.88% | 0% |
| Finnish | 0.65% | 5.9% | 93.45% | 0% |
| Italian | 0% | 99.54% | 0.46% | 0% |
| [1] | Errors are generally produced when the detector cannot guess the encoding of the input text. In Caveats we propose a solution for this, however, it is not clear the impact in the accuracy of detection. |
The results shows that for documents in the languages that langdet can detect, langdet behaves almost perfect.
However, the results for documents in other languages show how misleading langdet could be in such cases. We ran those test for illustration purposes only.
Nevertheless this results also shows that it would be very difficult for this simple algorithm to distinguish Spanish from Italian, and French from German.
| File | Type | Py Version | Uploaded on | Size | # downloads |
|---|---|---|---|---|---|
| oice.langdet-1.0dev-r781.tar.gz (md5) | Source | 2008-12-06 | 26KB | 906 | |
- Author: Universidad de las Ciencias Informáticas
- Home Page: http://www.uci.cu/
- License: GPL 3.0
- Categories
- Package Index Owner: mvaled
- DOAP record: oice.langdet-1.0dev-r781.xml
