datrie

Super-fast, efficiently stored Trie for Python

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

datrie

Super-fast, efficiently stored Trie for Python (2.x and 3.x). Uses libdatrie.

Installation

pip install datrie

Usage

Create a new trie capable of storing items with lower-case ascii keys:

>>> import string
>>> import datrie
>>> trie = datrie.new(string.ascii_lowercase)

trie variable is a dict-like object that can have unicode keys of certain ranges and Python objects as values.

In addition to implementing the mapping interface, tries facilitate finding the items for a given prefix, and vice versa, finding the items whose keys are prefixes of a given string. As a common special case, finding the longest-prefix item is also supported.

Add some values to it (datrie keys must be unicode; the examples are for Python 2.x):

>>> trie[u'foo'] = 5
>>> trie[u'foobar'] = 10
>>> trie[u'bar'] = 'bar value'
>>> trie.setdefault(u'foobar', 15)
10

Check if u’foo’ is in trie:

>>> u'foo' in trie
True

Get a value:

>>> trie[u'foo']
5

Find all prefixes of a word:

>>> trie.prefixes(u'foobarbaz')
[u'foo', u'foobar']

>>> trie.prefix_items(u'foobarbaz')
[(u'foo', 5), (u'foobar', 10)]

>>> trie.iter_prefixes(u'foobarbaz')
<generator object ...>

>>> trie.iter_prefix_items(u'foobarbaz')
<generator object ...>

Find the longest prefix of a word:

>>> trie.longest_prefix(u'foo')
u'foo'

>>> trie.longest_prefix(u'foobarbaz')
u'foobar'

>>> trie.longest_prefix(u'gaz')
KeyError: u'gaz'

>>> trie.longest_prefix(u'gaz', default=u'vasia')
u'vasia'

>>> trie.longest_prefix_item(u'foobarbaz')
(u'foobar', 10)

Check if the trie has keys with a given prefix:

>>> trie.has_keys_with_prefix(u'fo')
True

>>> trie.has_keys_with_prefix(u'FO')
False

Get all items with a given prefix from a trie:

>>> trie.keys(u'fo')
[u'foo', u'foobar']

>>> trie.items(u'ba')
[(u'bar', 'bar value')]

>>> trie.values(u'foob')
[10]

Save & load a trie (values must be picklable):

>>> trie.save('my.trie')
>>> trie2 = datrie.Trie.load('my.trie')

Trie and BaseTrie

There are two Trie classes in datrie package: datrie.Trie and datrie.BaseTrie. datrie.BaseTrie is slightly faster and uses less memory but it can store only integer numbers 0 <= x <= 2147483647. datrie.Trie is a bit slower but can store any Python object as a value; it is the trie created by datrie.new by default.

If you don’t need values or integer values are OK then use datrie.BaseTrie:

import datrie
import string
alpha_map = datrie.AlphaMap(alphabet=string.ascii_lowercase)
trie = datrie.BaseTrie(alpha_map=alpha_map)

Performance

Performance is measured for datrie.Trie against Python’s dict with 100k unique unicode words (English and Russian) as keys and ‘1’ numbers as values.

datrie.Trie uses about 5M memory for 100k words; Python’s dict uses about 22M for this according to my unscientific tests.

This trie implementation is 2-6 times slower than python’s dict on __getitem__. Benchmark results (macbook air i5 1.7GHz, “1.000M ops/sec” == “1 000 000 operations per second”):

Python 2.6:
dict __getitem__: 6.024M ops/sec
trie __getitem__: 2.272M ops/sec

Python 2.7:
dict __getitem__: 6.693M ops/sec
trie __getitem__: 2.357M ops/sec

Python 3.2:
dict __getitem__: 3.628M ops/sec
trie __getitem__: 1.980M ops/sec

Python 3.3b1:
dict __getitem__: 6.721M ops/sec
trie __getitem__: 2.584M ops/sec

Looking for prefixes of a given word is almost as fast as __getitem__ (results are for Python 3.2, they are even faster under Python 2.x on my machine):

trie.iter_prefix_items (hits):      0.431M ops/sec
trie.prefix_items (hits):           0.685M ops/sec
trie.prefix_items loop (hits):      0.601M ops/sec
trie.iter_prefixes (hits):          0.814M ops/sec
trie.iter_prefixes (misses):        1.565M ops/sec
trie.iter_prefixes (mixed):         1.461M ops/sec
trie.has_keys_with_prefix (hits):   1.945M ops/sec
trie.has_keys_with_prefix (misses): 2.625M ops/sec
trie.longest_prefix (hits):         1.750M ops/sec
trie.longest_prefix (misses):       1.569M ops/sec
trie.longest_prefix (mixed):        1.662M ops/sec
trie.longest_prefix_item (hits):    1.075M ops/sec
trie.longest_prefix_item (misses):  1.058M ops/sec
trie.longest_prefix_item (mixed):   1.083M ops/sec

Looking for all words starting with a given prefix is mostly limited by overall result count (this can be improved in future because a lot of time is spent decoding strings from utf_32_le to Python’s unicode):

trie.items(prefix="xxx"), avg_len(res)==415:        0.690K ops/sec
trie.keys(prefix="xxx"), avg_len(res)==415:         0.721K ops/sec
trie.values(prefix="xxx"), avg_len(res)==415:       2.151K ops/sec
trie.items(prefix="xxxxx"), avg_len(res)==17:       15.841K ops/sec
trie.keys(prefix="xxxxx"), avg_len(res)==17:        16.829K ops/sec
trie.values(prefix="xxxxx"), avg_len(res)==17:      43.930K ops/sec
trie.items(prefix="xxxxxxxx"), avg_len(res)==3:     71.620K ops/sec
trie.keys(prefix="xxxxxxxx"), avg_len(res)==3:      77.067K ops/sec
trie.values(prefix="xxxxxxxx"), avg_len(res)==3:    157.464K ops/sec
trie.items(prefix="xxxxx..xx"), avg_len(res)==1.4:  116.869K ops/sec
trie.keys(prefix="xxxxx..xx"), avg_len(res)==1.4:   128.392K ops/sec
trie.values(prefix="xxxxx..xx"), avg_len(res)==1.4: 194.388K ops/sec
trie.items(prefix="xxx"), NON_EXISTING:             1753.472K ops/sec
trie.keys(prefix="xxx"), NON_EXISTING:              1797.559K ops/sec
trie.values(prefix="xxx"), NON_EXISTING:            1705.695K ops/sec

Build time is worse than dict’s; updates are quite fast:

dict __setitem__ (updates): 3.489M ops/sec
trie __setitem__ (updates): 1.862M ops/sec
dict __setitem__ (inserts): 3.628M ops/sec
trie __setitem__ (inserts): 0.050M ops/sec
dict setdefault (updates):  2.575M ops/sec
trie setdefault (updates):  1.600M ops/sec
dict setdefault (inserts):  2.596M ops/sec
trie setdefault (inserts):  0.050M ops/sec

Please take this benchmark results with a grain of salt; this is a very simple benchmark and may not cover your use case.

Current Limitations

keys must be unicode (no implicit conversion for byte strings under Python 2.x, sorry);
there are no iterator versions of keys/values/items (this is a current limitation of libdatrie);
it doesn’t work under pypy+MacOS X (some obscure error which I don’t understand);
library is not tested with narrow Python builds.

Contributing

Development happens at github and bitbucket:

The main issue tracker is at github.

Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.

Running tests and benchmarks

Make sure tox is installed and run

$ tox

from the source checkout. Tests should pass under python 2.6, 2.7, 3.2 and 3.3.

$ tox -c tox-bench.ini

runs benchmarks.

If you’ve changed anything in the source code then make sure cython is installed and run

$ update_c.sh

before each tox command.

Please note that benchmarks are not included in the release tar.gz’s because benchmark data is large and this saves a lot of bandwidth; use source checkouts from github or bitbucket for the benchmarks.

Authors & Contributors

Mikhail Korobov <kmike84@gmail.com>

This module is based on libdatrie C library and is inspired by fast_trie Ruby bindings, PyTrie pure Python implementation and Tree::Trie Perl implementation; some docs and API ideas are borrowed from these projects.

License

Licensed under LGPL v2.1.

CHANGES

0.2 (2012-07-16)

datrie.Trie items can have any Python object as a value (Trie from 0.1.x becomes datrie.BaseTrie);
longest_prefix and longest_prefix_items are fixed;
save & load are rewritten;
setdefault method.

0.1.1 (2012-07-13)

Windows support (upstream libdatrie changes are merged);
license is changed from LGPL v3 to LGPL v2.1 to match the libdatrie license.

0.1 (2012-07-12)

Initial release.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.8.2

Mar 26, 2020

0.8

Jul 3, 2019

0.7.1

Mar 11, 2016

0.7

Feb 17, 2014

0.6.1

Sep 20, 2013

0.6

Jul 9, 2013

0.5.1

Jan 29, 2013

0.5

Jan 29, 2013

0.4.2

Sep 2, 2012

0.4.1

Jul 29, 2012

0.4

Jul 27, 2012

0.3

Jul 21, 2012

This version

0.2

Jul 16, 2012

0.1.1

Jul 13, 2012

0.1

Jul 11, 2012

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datrie-0.2.tar.gz (127.9 kB view hashes)

Uploaded Jul 16, 2012 Source

Hashes for datrie-0.2.tar.gz

Hashes for datrie-0.2.tar.gz
Algorithm	Hash digest
SHA256	`e908cf61751355c847a666fd85cc77022e2b24d35b3ea8c67f23b2d6e05ed232`
MD5	`9d639ac1b16e9b066fc2c1a153f7015a`
BLAKE2b-256	`6b3cfbbb87fb62bb0b4ab737eb6c58c24764f42eb5a09771dc3e0dafba95aa44`

datrie 0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

datrie

Installation

Usage

Trie and BaseTrie

Performance

Current Limitations

Contributing

Running tests and benchmarks

Authors & Contributors

License

CHANGES

0.2 (2012-07-16)

0.1.1 (2012-07-13)

0.1 (2012-07-12)

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution