skip to navigation
skip to content

Not Logged In

marisa-trie 0.3.6

Static memory-efficient & fast Trie structures for Python (based on marisa-trie C++ library)

Latest Version: 0.6

marisa-trie

Static memory-efficient Trie structures for Python (2.x and 3.x).

String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search.

Based on marisa-trie C++ library.

Note

There are official SWIG-based Python bindings included in C++ library distribution; this package provides an alternative Cython-based pip-installable Python bindings.

Installation

pip install marisa-trie

Usage

There are several Trie classes in this package:

  • marisa_trie.Trie - read-only trie-based data structure that maps unicode keys to auto-generated unique IDs;
  • marisa_trie.RecordTrie - read-only trie-based data structure that maps unicode keys to lists of data tuples. All tuples must be of the same format (the data is packed using python struct module).
  • marisa_trie.BytesTrie - read-only Trie that maps unicode keys to lists of bytes objects.

marisa_trie.Trie

Create a new trie:

>>> import marisa_trie
>>> trie = marisa_trie.Trie([u'key1', u'key2', u'key12'])

Check if key is in trie:

>>> u'key1' in trie
True
>>> u'key20' in trie
False

Each key is assigned an unique ID from 0 to (n - 1), where n is the number of keys; you can use this ID to store a value in a separate structure (e.g. python list):

>>> trie.key_id(u'key2')
1

Key can be reconstructed from the ID:

>>> trie.restore_key(1)
u'key2'

Find all prefixes of a given key:

>>> trie.prefixes(u'key12')
[u'key1', u'key12']

There is also a generator version of .prefixes method called .iter_prefixes.

Find all keys from this trie that starts with a given prefix:

>> trie.keys(u'key1')
[u'key1', u'key12']

(iterator version .iterkeys(prefix) is also available).

marisa_trie.RecordTrie

Create a new trie:

>>> keys = [u'foo', u'bar', u'foobar', u'foo']
>>> values = [(1, 2), (2, 1), (3, 3), (2, 1)]
>>> fmt = "<HH"   # a tuple with 2 short integers
>>> trie = marisa_trie.RecordTrie(fmt, zip(keys, values))

Trie initial data must be an iterable of tuples (unicode_key, data_tuple). Data tuples will be converted to bytes with struct.pack(fmt, *data_tuple).

Take a look at http://docs.python.org/library/struct.html#format-strings for the format string specification.

Duplicate keys are allowed.

Check if key is in trie:

>>> u'foo' in trie
True
>>> u'spam' in trie
False

Get a values list:

>>> trie[u'bar']
[(2, 1)]
>>> trie[u'foo']
[(1, 2), (2, 1)]
>>> trie.get(u'bar', 123)
[(2, 1)]
>>> trie.get(u'BAAR', 123) # default value
123

Find all prefixes of a given key:

>>> trie.prefixes(u'foobarz')
[u'foo', u'foobar']

Find all keys from this trie that starts with a given prefix:

>> trie.keys(u'fo')
[u'foo', u'foo', u'foobar']

Find all items from this trie that starts with a given prefix:

>> trie.items(u'fo')
[(u'foo', (1, 2)), (u'foo', (2, 1), (u'foobar', (3, 3))]

Note

Iterator version of .keys() and .items() are not implemented yet.

marisa_trie.BytesTrie

BytesTrie is similar to RecordTrie, but the values are raw bytes, not tuples:

>>> keys = [u'foo', u'bar', u'foobar', u'foo']
>>> values = [b'foo-value', b'bar-value', b'foobar-value', b'foo-value2']
>>> trie = marisa_trie.BytesTrie(zip(keys, values))
>>> trie[u'bar']
[b'bar-value']

Persistence

Trie objects supports saving/loading, pickling/unpickling and memory mapped I/O.

Write trie to a stream:

>>> with open('my_trie.marisa', 'w') as f:
...     trie.write(f)

Save trie to a file:

>>> trie.save('my_trie_copy.marisa')

Read trie from stream:

>>> trie2 = marisa_trie.Trie()
>>> with open('my_trie.marisa', 'r') as f:
...     trie.read(f)

Load trie from file:

>>> trie2.load('my_trie.marisa')

Trie objects are picklable:

>>> import pickle
>>> data = pickle.dumps(trie)
>>> trie3 = pickle.loads(data)

You may also build a trie using marisa-build command-line utility (provided by underlying C++ library; it should be downloaded and compiled separately) and then load the trie from the resulting file using .load() method.

Memory mapped I/O

It is possible to use memory mapped file as data source:

>>> trie = marisa_trie.RecordTrie(fmt).mmap('my_record_trie.marisa')

This way the whole dictionary won't be loaded to memory; memory mapped I/O is an easy way to share dictionary data among processes.

Warning

Memory mapped trie might cause a lot of random disk accesses which considerably increase the search time.

Trie storage options

marisa-trie C++ library provides some configuration options for trie storage; check http://marisa-trie.googlecode.com/svn/trunk/docs/readme.en.html page (scroll down to "Enumeration Constants" section) to get an idea.

These options are exposed as order, num_tries, cache_size and binary keyword arguments for trie constructors.

For example, set order to marisa_trie.LABEL_ORDER in order to make trie functions return results in alphabetical oder:

>>> trie = marisa_trie.RecordTrie(fmt, data, order=marisa_trie.LABEL_ORDER)

Benchmarks

My quick tests show that memory usage is quite decent. For a list of 3000000 (3 million) Russian words memory consumption with different data structures (under Python 2.7):

  • dict(unicode words -> word lenghts): about 600M
  • list(unicode words) : about 300M
  • BaseTrie from datrie library: about 70M
  • marisa_trie.RecordTrie : 11M
  • marisa_trie.Trie: 7M

Note

Lengths of words were stored as values in datrie.BaseTrie and marisa_trie.RecordTrie. RecordTrie compresses similar values and the key compression is better so it uses much less memory than datrie.BaseTrie.

marisa_trie.Trie provides auto-assigned IDs. It is not possible to store arbitrary values in marisa_trie.Trie so it uses less memory than RecordTrie.

Benchmark results (100k unicode words, integer values (lenghts of the words), Python 3.2, macbook air i5 1.8 Ghz):

dict __getitem__ (hits):            4.090M ops/sec
Trie __getitem__ (hits):            not supported
BytesTrie __getitem__ (hits):       0.469M ops/sec
RecordTrie __getitem__ (hits):      0.373M ops/sec

dict get() (hits):                  2.792M ops/sec
Trie get() (hits):                  not supported
BytesTrie get() (hits):             0.434M ops/sec
RecordTrie get() (hits):            0.369M ops/sec
dict get() (misses):                2.867M ops/sec
Trie get() (misses):                not supported
BytesTrie get() (misses):           0.817M ops/sec
RecordTrie get() (misses):          0.824M ops/sec

dict __contains__ (hits):           4.036M ops/sec
Trie __contains__ (hits):           0.910M ops/sec
BytesTrie __contains__ (hits):      0.573M ops/sec
RecordTrie __contains__ (hits):     0.591M ops/sec
dict __contains__ (misses):         3.346M ops/sec
Trie __contains__ (misses):         1.643M ops/sec
BytesTrie __contains__ (misses):    0.976M ops/sec
RecordTrie __contains__ (misses):   1.017M ops/sec

dict items():                       58.316 ops/sec
Trie items():                       not supported
BytesTrie items():                  11.914 ops/sec
RecordTrie items():                 8.668 ops/sec

dict keys():                        211.194 ops/sec
Trie keys():                        19.198 ops/sec
BytesTrie keys():                   15.399 ops/sec
RecordTrie keys():                  15.277 ops/sec

Trie.prefixes (hits):               0.525M ops/sec
Trie.prefixes (mixed):              1.522M ops/sec
Trie.prefixes (misses):             1.191M ops/sec
RecordTrie.prefixes (hits):         0.106M ops/sec
RecordTrie.prefixes (mixed):        0.451M ops/sec
RecordTrie.prefixes (misses):       0.173M ops/sec
Trie.iter_prefixes (hits):          0.536M ops/sec
Trie.iter_prefixes (mixed):         1.248M ops/sec
Trie.iter_prefixes (misses):        1.001M ops/sec

Trie.keys(prefix="xxx"), avg_len(res)==415:         5.087K ops/sec
Trie.keys(prefix="xxxxx"), avg_len(res)==17:        86.911K ops/sec
Trie.keys(prefix="xxxxxxxx"), avg_len(res)==3:      258.711K ops/sec
Trie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4:   280.668K ops/sec
Trie.keys(prefix="xxx"), NON_EXISTING:              1076.751K ops/sec

RecordTrie.keys(prefix="xxx"), avg_len(res)==415:       3.754K ops/sec
RecordTrie.keys(prefix="xxxxx"), avg_len(res)==17:      73.784K ops/sec
RecordTrie.keys(prefix="xxxxxxxx"), avg_len(res)==3:    234.210K ops/sec
RecordTrie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4: 274.754K ops/sec
RecordTrie.keys(prefix="xxx"), NON_EXISTING:            1061.222K ops/sec

Tries from marisa_trie uses less memory, tries from datrie are faster.

Please take this benchmark results with a grain of salt; this is a very simple benchmark on a single data set.

Current limitations

  • The library is known not to install on Windows with mingw32 compiler;
  • .prefixes() method of BytesTrie and RecordTrie is quite slow;
  • read() and write() methods don't work with file-like objects (they work only with real files; pickling works fine for file-like objects);
  • iterator versions of methods are not always implemented;
  • there are keys() and items() methods but no values() method.

Contributions are welcome!

Contributing

Development happens at github and bitbucket:

The main issue tracker is at github: https://github.com/kmike/marisa-trie/issues

Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.

If you found a bug in a C++ part please report it to the original bug tracker.

How is source code organized

There are 4 folders in repository:

  • bench - benchmarks & benchmark data;
  • lib - original unmodified marisa-trie C++ library which is bundled for easier distribution; if something is have to be fixed in this library consider fixing it in the original repo ;
  • src - wrapper code; src/marisa_trie.pyx is a wrapper implementation; src/*.pxd files are Cython headers for corresponding C++ headers; src/*.cpp files are the pre-built extension code and shouldn't be modified directly (they should be updated via update_cpp.sh script).
  • tests - the test suite.

Running tests and benchmarks

Make sure tox is installed and run

$ tox

from the source checkout. Tests should pass under python 2.6, 2.7, 3.2.

In order to run benchmarks, type

$ tox -c bench.ini

Authors & Contributors

This module is based on marisa-trie C++ library by Susumu Yata & contributors.

License

Wrapper code is licensed under MIT License. Bundled marisa-trie C++ library is licensed under BSD license.

CHANGES

0.3.6 (2012-09-05)

  • much faster (3x-7x) .items() and .keys() methods for all tries; faster (up to 3x) .prefixes() method for Trie.

0.3.5 (2012-08-30)

  • Pickling of RecordTrie is fixed (thanks lazarou for the report);
  • error messages should become more useful.

0.3.4 (2012-08-29)

  • Issues with mingw32 should be resolved (thanks Susumu Yata).

0.3.3 (2012-08-27)

  • .get(key, default=None) method for BytesTrie and RecordTrie;
  • small README improvements.

0.3.2 (2012-08-26)

  • Small code cleanup;
  • load, read and mmap methods returns 'self';
  • I can't run tests (via tox) under Python 3.3 so it is removed from supported versions for now.

0.3.1 (2012-08-23)

  • .prefixes() support for RecordTrie and BytesTrie.

0.3 (2012-08-23)

  • RecordTrie and BytesTrie are introduced;
  • IntTrie class is removed (probably temporary?);
  • dumps/loads methods are renamed to tobytes/frombytes;
  • benchmark & tests improvements;
  • support for MARISA-trie config options is added.

0.2 (2012-08-19)

  • Pickling/unpickling support;
  • dumps/loads methods;
  • python 3.3 workaround;
  • improved tests;
  • benchmarks.

0.1 (2012-08-17)

Initial release.

 
File Type Py Version Uploaded on Size
marisa-trie-0.3.6.tar.gz (md5) Source 2012-09-05 163KB
  • Downloads (All Versions):
  • 44 downloads in the last day
  • 533 downloads in the last week
  • 1910 downloads in the last month