skip to navigation
skip to content

oice.langdet 1.0dev-r781

Automatic Language Detector

Downloads ↓

Language Detector

This is a simple (yet powerful) automatic language detector. Currently the only languages we are capable to detect are:

  • English
  • Spanish
  • French

Installation and Usage

To install just run the easy_install tool:

easy_install oice.langdet

This will install a console script langdet. Run langdet passing a plain text filename as the first parameter. Examples:

langdet simple.txt

This will return the 2-letters ISO 639-1 code of the detected language.

You may also use oice.langdet in Python scripts like this:

#!/usr/bin/env python2.5
from StringIO import StringIO

from oice.langdet import langdet
from oice.langdet import streams
from oice.langdet import languages

text = streams.Stream(StringIO(u"Must be a Python Unicode text"))
lang = langdet.LanguageDetector.detect(text)
if lang == languages.spanish:
    print u'Texto en español'
elif lang == languages.english:
    print u'English text'
else:
    print u'France' # I don't speak/write French

Caveats

Currently there are some restrictions:

  • langdet does not work properly with standard input nor pipelines.

  • You cannot use a file-like object directly with LanguageDetector, i.e, you must use the Stream wrapper.

    This is so because we try to guess the text encoding and normalize it to a Python Unicode String. However, we plan to remove this normalization step and count the frequency of octets and pairs of octets instead.

  • If the piece of text is not written in any of the languages we can detect, the best match (see How it works) is selected.

Work in progress

In a sentence: trying to solve the first two caveats, and thinking in Python 2.6 and Python 3.0.

How it works

Language detection is based on stats on the frequency of letters and pairs of letters of the input text.

The modules in the package oice.language.languages contains a "footprint" of text in those languages.

The texts used in the generation of the footprints were:

  • El ingenioso hidalgo Don Quijote de la Mancha
  • The Holly Bible
  • La Folle Journée, ou Le Mariage de Figaro

When trying to detect the language of some piece of text, first we count the frequencies of letters and pairs of letters in the text and then compare the results with the footprints of those language, the best match is selected.

We use the simple cosine similarity equation to compare the text with the footprints of those texts.

Accuracy of the detection

To test the accuracy of this implementation we downloaded the full European Parliament Proceedings Parallel Corpus 1996-2006 and ran the langdet script to the sets of English, Spanish and French documents.

For each language we count the times the correct ISO 639-1 code was returned by langdet like this (for counting documents detected as Spanish written):

find -type f -exec langdet {} \; | grep es | wc -l

The results are summarized in the following table:

Summary of accuracy test for langdet
Real language English Spanish French Errors [1]
English 98.78% 0% 0% 1.22%
Spanish 0% 100% 0% 0%
French 0% 0% 100% 0%
Danish 1.22% 16.08% 82.7% 0%
German 1.97% 0.15% 97.88% 0%
Finnish 0.65% 5.9% 93.45% 0%
Italian 0% 99.54% 0.46% 0%
[1]

Errors are generally produced when the detector cannot guess the encoding of the input text.

In Caveats we propose a solution for this, however, it is not clear the impact in the accuracy of detection.

The results shows that for documents in the languages that langdet can detect, langdet behaves almost perfect.

However, the results for documents in other languages show how misleading langdet could be in such cases. We ran those test for illustration purposes only.

Nevertheless this results also shows that it would be very difficult for this simple algorithm to distinguish Spanish from Italian, and French from German.

 
File Type Py Version Uploaded on Size # downloads
oice.langdet-1.0dev-r781.tar.gz (md5) Source 2008-12-06 26KB 906