skip to navigation
skip to content

NoAho 0.9

Fast, non-overlapping simultaneous multiple keyword search

Latest Version:

Non-Overlapping Aho-Corasick Trie

- 'short' and 'long' (longest key) searches, both one-off and iteration
over all in some text (non-overlapping). (Short probably isn't useful
without overlap.)
- Works with both unicode and str in Python 2, and unicode in Python 3
(it's all UCS4 under the hood).
- Allows you to associate an arbitrary Python object payload with each
keyword, and supports dict operations len, [], and 'in' for the
keywords (though no del or traversal).
- Does the 'compilation' (generation of Aho-Corasick failure links) of
the trie on-demand; you can mix adding keywords and searching text
- Can be used commercially, it has a minimal license.

- Does not support overlapping keywords, unless you move along the
string manually, one character at a time, which would defeat the
- find[all]_short is kind of useless.
- Lacks key iteration and deletion from the mapping (dict) protocol
- memory leaking untested (should be ok but ...)
- No /testcase/ for unicode in Python 2 (did manual test however)
- Unicode chars represented as ucs4, and, each character has its own
hashtable, so it's relatively memory-heavy.
- Requires a C++ compiler.

Bug reports and patches welcome of course!

To build and install, use either
# Python 2
python install # (or build, and copy the .so to where you want it)
# Python 3
python3 install # (or build, and copy the .so to where you want it)

'text' below applies to str and unicode in Python 2, or unicode in Python 3 (all there is)
t.add(key_text, optional payload)
(start, end, key_payload) = t.find_short(text_to_search)
(start, end, key_payload) = t.find_long(text_to_search)
(start, end, key_payload) = t.findall_short(text_to_search)
(start, end, key_payload) = t.findall_long(text_to_search)

>>> a = NoAho()
>>> a.add('ms windows')
>>> a.add('ms windows 2000', "this is canonical")
>>> a.add('windows', None)
>>> a.add('windows 2000')
>>> text = 'windows 2000 ms windows 2000 windows'
>>> for k in a.findall_short(text):
... print text[k[0]:k[1]]
ms windows
>>> for k in a.findall_long(text):
... print text[k[0]:k[1]]
windows 2000
ms windows 2000

Mapping (dictionary) methods:
trie = NoAho()
trie['apple'] = apple_handling_function
trie['orange'] = Orange()
trie.add('banana') # payload will be None
trie['pear'] # will give key error
assert isinstance(trie['orange'], Orange)
assert 'banana' in trie
# No del;
# no iteration over keys

The 'find[all]_short' forms are named as long and awkwardly as they are,
to leave plain 'find[all]' free if overlapping matches are ever implemented.

Deep Rebuilding:
Written in C++ and Cython.

You should not need to use Cython to build and use this module, but if
you want to change it and regenerate, there's a build command line at
the top of noaho.pyx. To allow the generated noaho.cpp file to be
used in both Py2 and Py3, use a Python 3-built Cython. This was built
with Cython-0.15.1, but even as far back as 0.13 should work (for that
version though, you'd have to manually use a c++ compiler to do the
last step, of linking).

For the fullest spec of what the code will and will not so, check out (python[3]

Untested: whether the payload handling is complete, ie that there are no
memory leaks. It should be correct though.  
File Type Py Version Uploaded on Size
NoAho-0.9.tar.gz (md5) Source 2012-03-20 38KB