skip to navigation
skip to content

Not Logged In

DAWG 0.4

Fast and memory efficient DAWG for Python

Latest Version: 0.7.2

DAWG

This package provides DAWG-based dictionary-like read-only objects for Python (2.x and 3.x).

String data in a DAWG (Directed Acyclic Word Graph) may take 200x less memory than in a standard Python dict or list and the raw lookup speed is comparable. DAWG may be even faster than built-in dict for some operations. It also provides fast advanced methods like prefix search.

Based on dawgdic C++ library.

Installation

pip install DAWG

Usage

There are several DAWG classes in this package:

  • dawg.DAWG - basic DAWG wrapper; it can store unicode keys and do exact lookups;
  • dawg.CompletionDAWG - dawg.DAWG subclass that supports key completion and prefix lookups (but requires more memory);
  • dawg.BytesDAWG - dawg.CompletionDAWG subclass that maps unicode keys to lists of bytes objects.
  • dawg.RecordDAWG - dawg.BytesDAWG subclass that maps unicode keys to lists of data tuples. All tuples must be of the same format (the data is packed using python struct module).
  • dawg.IntDAWG - dawg.DAWG subclass that maps unicode keys to integer values.

DAWG and CompletionDAWG

DAWG and CompletionDAWG are useful when you need fast & memory efficient simple string storage. These classes does not support assigning values to keys.

DAWG and CompletionDAWG constructors accept an iterable with keys:

>>> import dawg
>>> words = [u'foo', u'bar', u'foobar', u'foö', u'bör']
>>> base_dawg = dawg.DAWG(words)
>>> completion_dawg = dawg.CompletionDAWG(words)

It is then possible to check if the key is in a DAWG:

>>> u'foo' in base_dawg
True
>>> u'baz' in completion_dawg
False

It is possible to find all keys that starts with a given prefix in a CompletionDAWG:

>>> completion_dawg.keys(u'foo')
>>> [u'foo', u'foobar']

and to find all prefixes of a given key:

>>> base_dawg.prefixes(u'foobarz')
[u'foo', u'foobar']

Iterator versions are also available:

>>> for key in completion_dawg.iterkeys(u'foo'):
...     print(key)
foo
foobar
>>> for prefix in base_dawg.iterprefixes(u'foobarz'):
...     print(prefix)
foo
foobar

It is possible to find all keys similar to a given key (using a one-way char translation table):

>>> replaces = dawg.DAWG.compile_replaces({u'o': u'ö'})
>>> base_dawg.similar_keys(u'foo', replaces)
[u'foo', u'foö']
>>> base_dawg.similar_keys(u'foö', replaces)
[u'foö']
>>> base_dawg.similar_keys(u'bor', replaces)
[u'bör']

BytesDAWG

BytesDAWG is a CompletionDAWG subclass that can store binary data for each key.

BytesDAWG constructor accepts an iterable with (unicode_key, bytes_value) tuples:

>>> data = [(u'key1', b'value1'), (u'key2', b'value2'), (u'key1', b'value3')]
>>> bytes_dawg = dawg.BytesDAWG(data)

There can be duplicate keys; all unique values are stored in this case:

>>> bytes_dawg[u'key1']
[b'value1, b'value3']

For unique keys a list with a single value is returned for consistency:

>>> bytes_dawg[u'key2']
[b'value2']

KeyError is raised for missing keys; use get method if you need a default value instead:

>>> bytes_dawg.get(u'foo', None)
None

BytesDAWG support keys, items, iterkeys and iteritems methods (they all accept optional key prefix). There is also support for similar_keys, similar_items and similar_item_values methods.

Note

Currently the order of keys returned by BytesDAWG is not the same as the order of keys returned by CompletionDAWG because of the way BytesDAWG is implemented: values are internally stored inside DAWG keys after a separator; separator is a chr(255) byte and thus 'foo' key is greater than 'foobar' key (values compared are 'foo<sep>' and 'foobar<sep>').

RecordDAWG

RecordDAWG is a BytesDAWG subclass that automatically packs & unpacks the binary data from/to Python objects using struct module from the standard library.

First, you have to define a format of the data. Consult Python docs (http://docs.python.org/library/struct.html#format-strings) for the format string specification.

For example, let's store 3 short unsigned numbers (in a Big-Endian byte order) as values:

>>> format = ">HHH"

RecordDAWG constructor accepts an iterable with (unicode_key, value_tuple). Let's create such iterable using zip function:

>>> keys = [u'foo', u'bar', u'foobar', u'foo']
>>> values = [(1, 2, 3), (2, 1, 0), (3, 3, 3), (2, 1, 5)]
>>> data = zip(keys, values)
>>> record_dawg = RecordDAWG(format, data)

As with BytesDAWG, there can be several values for the same key:

>>> record_dawg['foo']
[(1, 2, 3), (2, 1, 5)]
>>> record_dawg['foobar']
[(3, 3, 3)]

IntDAWG

IntDAWG is a {unicode -> int} mapping. It is possible to use RecordDAWG for this, but IntDAWG is natively supported by dawgdic C++ library and so __getitem__ is much faster.

Unlike BytesDAWG and RecordDAWG, IntDAWG doesn't support having several values for the same key.

IntDAWG constructor accepts an iterable with (unicode_key, integer_value) tuples:

>>> data = [ (u'foo', 1), (u'bar', 2) ]
>>> int_dawg = dawg.IntDAWG(data)

It is then possible to get a value from the IntDAWG:

>>> int_dawg[u'foo']
1

Persistence

All DAWGs support saving/loading and pickling/unpickling.

Write DAWG to a stream:

>>> with open('words.dawg', 'wb') as f:
...     d.write(f)

Save DAWG to a file:

>>> d.save('words.dawg')

Load DAWG from a file:

>>> d = dawg.DAWG()
>>> d.load('words.dawg')

Warning

Reading DAWGs from streams and unpickling are currently using 3x memory compared to loading DAWGs using load method; please avoid them until the issue is fixed.

Read DAWG from a stream:

>>> d = dawg.RecordDAWG(format_string)
>>> with open('words.record-dawg', 'rb') as f:
...     d.read(f)

DAWG objects are picklable:

>>> import pickle
>>> data = pickle.dumps(d)
>>> d2 = pickle.loads(data)

Benchmarks

For a list of 3000000 (3 million) Russian words memory consumption with different data structures (under Python 2.7):

  • dict(unicode words -> word lenghts): about 600M
  • list(unicode words) : about 300M
  • marisa_trie.RecordTrie : 11M
  • marisa_trie.Trie: 7M
  • dawg.DAWG: 2M
  • dawg.CompletionDAWG: 3M
  • dawg.IntDAWG: 2.7M
  • dawg.RecordDAWG: 4M

Note

Lengths of words were not stored as values in dawg.DAWG, dawg.CompletionDAWG and marisa_trie.Trie because they don't support this.

Benchmark results (100k unicode words, integer values (lenghts of the words), Python 3.2, macbook air i5 1.8 Ghz):

dict __getitem__ (hits):        4.102M ops/sec
DAWG __getitem__ (hits):        not supported
BytesDAWG __getitem__ (hits):   1.558M ops/sec
RecordDAWG __getitem__ (hits):  0.950M ops/sec
IntDAWG __getitem__ (hits):     2.835M ops/sec
dict get() (hits):              3.053M ops/sec
DAWG get() (hits):              not supported
BytesDAWG get() (hits):         1.340M ops/sec
RecordDAWG get() (hits):        0.882M ops/sec
IntDAWG get() (hits):           2.370M ops/sec
dict get() (misses):            3.250M ops/sec
DAWG get() (misses):            not supported
BytesDAWG get() (misses):       2.483M ops/sec
RecordDAWG get() (misses):      2.249M ops/sec
IntDAWG get() (misses):         2.806M ops/sec

dict __contains__ (hits):           4.068M ops/sec
DAWG __contains__ (hits):           3.065M ops/sec
BytesDAWG __contains__ (hits):      2.627M ops/sec
RecordDAWG __contains__ (hits):     2.613M ops/sec
IntDAWG __contains__ (hits):        3.021M ops/sec

dict __contains__ (misses):         3.471M ops/sec
DAWG __contains__ (misses):         3.537M ops/sec
BytesDAWG __contains__ (misses):    3.381M ops/sec
RecordDAWG __contains__ (misses):   3.361M ops/sec
IntDAWG __contains__ (misses):      3.540M ops/sec

dict items():       58.754 ops/sec
DAWG items():       not supported
BytesDAWG items():  15.914 ops/sec
RecordDAWG items(): 10.699 ops/sec
IntDAWG items():    not supported

dict keys():        214.499 ops/sec
DAWG keys():        not supported
BytesDAWG keys():   23.929 ops/sec
RecordDAWG keys():  23.726 ops/sec
IntDAWG keys():     not supported

DAWG.prefixes (hits):    0.244M ops/sec
DAWG.prefixes (mixed):   1.414M ops/sec
DAWG.prefixes (misses):  2.156M ops/sec

RecordDAWG.keys(prefix="xxx"), avg_len(res)==415:       6.057K ops/sec
RecordDAWG.keys(prefix="xxxxx"), avg_len(res)==17:      130.680K ops/sec
RecordDAWG.keys(prefix="xxxxxxxx"), avg_len(res)==3:    507.355K ops/sec
RecordDAWG.keys(prefix="xxxxx..xx"), avg_len(res)==1.4: 745.566K ops/sec
RecordDAWG.keys(prefix="xxx"), NON_EXISTING:            3032.758K ops/sec

Please take this benchmark results with a grain of salt; this is a very simple benchmark on a single data set.

Current limitations

  • IntDAWG is currently a subclass of DAWG and so it doesn't support keys() and items() methods;
  • read() method reads the whole stream (DAWG must be the last or the only item in a stream if it is read with read() method) - pickling doesn't have this limitation;
  • DAWGs loaded with read() and unpickled DAWGs uses 3x-4x memory compared to DAWGs loaded with load() method;
  • there are keys() and items() methods but no values() method;
  • iterator versions of methods are not always implemented;
  • BytesDAWG and RecordDAWG key order is different from CompletionDAWG key order;
  • BytesDAWG and RecordDAWG has a limitation: values larger than 8KB are unsupported.

Contributions are welcome!

Contributing

Development happens at github and bitbucket:

The main issue tracker is at github: https://github.com/kmike/DAWG/issues

Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.

If you found a bug in a C++ part please report it to the original bug tracker.

How is source code organized

There are 4 folders in repository:

  • bench - benchmarks & benchmark data;
  • lib - original unmodified dawgdic C++ library and a customized version of libb64 library. They are bundled for easier distribution; if something is have to be fixed in these libraries consider fixing it in the original repositories;
  • src - wrapper code; src/dawg.pyx is a wrapper implementation; src/*.pxd files are Cython headers for corresponding C++ headers; src/*.cpp files are the pre-built extension code and shouldn't be modified directly (they should be updated via update_cpp.sh script).
  • tests - the test suite.

Running tests and benchmarks

Make sure tox is installed and run

$ tox

from the source checkout. Tests should pass under python 2.6, 2.7, 3.2 and 3.3.

In order to run benchmarks, type

$ tox -c bench.ini

Authors & Contributors

This module is based on dawgdic C++ library by Susumu Yata & contributors.

base64 decoder is based on libb64 by Chris Venter.

License

Wrapper code is licensed under MIT License. Bundled dawgdic C++ library is licensed under BSD license. libb64 is Public Domain.

0.4 (2012-09-26)

  • iterkeys, iteritems and iterprefixes methods (thanks Dan Blanchard).

0.3.2 (2012-09-24)

  • prefixes method for finding all prefixes of a given key.

0.3.1 (2012-09-20)

  • bundled dawgdic C++ library is updated to the latest version.

0.3 (2012-09-13)

  • similar_keys, similar_items and similar_item_values methods for more permissive lookups (they may be useful e.g. for umlaut handling);
  • load method returns self;
  • Python 3.3 support.

0.2 (2012-09-08)

Greatly improved memory usage for DAWGs loaded with load method.

There is currently a bug somewhere in a wrapper so DAWGs loaded with read() method or unpickled DAWGs uses 3x-4x memory compared to DAWGs loaded with load() method. load() is fixed in this release but other methods are not.

0.1 (2012-09-08)

Initial release.

 
File Type Py Version Uploaded on Size
DAWG-0.4.tar.gz (md5) Source 2012-09-25 206KB
  • Downloads (All Versions):
  • 546 downloads in the last day
  • 2659 downloads in the last week
  • 14931 downloads in the last month