skip to navigation
skip to content

ruscorpora-tools 0.3

Python interface to a free corpus subset from

This package provides Python interface to a free corpus subset available at


pip install ruscorpora-tools


Corpus downloading

Download and unpack the archive with XML files from

Corpus reading

ruscorpora.parse_xml function parses single XML file and returns an iterator over sentences; each sentence is a list of ruscorpora.Token instances, annotated with a list of ruscorpora.Annotation instances.

ruscorpora.simplify simplifies a result of ruscorpora.parse_xml by removing ambiguous annotations, joining split tokens (+ joining their annotations) and removing accent information.

>>> import ruscorpora as rnc
>>> for sent in rnc.simplify(rnc.parse('fiction.xml')):
...     print(sent)

Working with tags

ruscorpora.Tag class is a convenient wrapper for tags used in ruscorpora:

>>> tag = rnc.Tag('S,f,inan=sg,nom')
>>> tag.POS
>>> tag.gender
>>> tag.animacy
>>> tag.number
>>> tag.tense

(there are also other attributes).

Check if a grammeme is in tag:

>>> 'S' in tag
>>> 'V' in tag
>>> 'Foo' in tag
Traceback (most recent call last)
ValueError: Grammeme is unknown: Foo

Test tags equality:

>>> tag == rnc.Tag('S,f,inan=sg,nom')
>>> tag == 'S,f,inan=sg,nom'
>>> tag == rnc.Tag('S,f,inan=sg,acc')
>>> tag == 'S,f,inan=sg,acc'
>>> tag == 'Foo,inan'
Traceback (most recent call last)
ValueError: Unknown grammemes: frozenset({Foo})

Tags returned by rnc.simplify are wrapped with this class by default.


Development happens at github and bitbucket:

The issue tracker is at github:

Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.

Running tests

Make sure tox is installed and run

$ tox

from the source checkout. Tests should pass under python 2.6..3.3 and pypy > 1.8.

File Type Py Version Uploaded on Size
ruscorpora-tools-0.3.tar.gz (md5) Source 2013-02-14 8KB