skip to navigation
skip to content

Not Logged In

NoAho 0.9

Fast, non-overlapping simultaneous multiple keyword search

Latest Version: 0.9.02

Non-Overlapping Aho-Corasick Trie

Features:
- 'short' and 'long' (longest key) searches, both one-off and iteration
  over all in some text (non-overlapping). (Short probably isn't useful
  without overlap.)
- Works with both unicode and str in Python 2, and unicode in Python 3
  (it's all UCS4 under the hood).
- Allows you to associate an arbitrary Python object payload with each
  keyword, and supports dict operations len, [], and 'in' for the
  keywords (though no del or traversal).
- Does the 'compilation' (generation of Aho-Corasick failure links) of
  the trie on-demand; you can mix adding keywords and searching text
  freely.
- Can be used commercially, it has a minimal license.

Anti-Features:
- Does not support overlapping keywords, unless you move along the
  string manually, one character at a time, which would defeat the
  purpose.
- find[all]_short is kind of useless.
- Lacks key iteration and deletion from the mapping (dict) protocol
- memory leaking untested (should be ok but ...)
- No /testcase/ for unicode in Python 2 (did manual test however)
- Unicode chars represented as ucs4, and, each character has its own
  hashtable, so it's relatively memory-heavy.
- Requires a C++ compiler.

Bug reports and patches welcome of course!



To build and install, use either
  # Python 2
  python setup.py install # (or build, and copy the .so to where you want it)
or
  # Python 3
  python3 setup.py install # (or build, and copy the .so to where you want it)


'text' below applies to str and unicode in Python 2, or unicode in Python 3 (all there is)
    t.add(key_text, optional payload)
    (start, end, key_payload) = t.find_short(text_to_search)
    (start, end, key_payload) = t.find_long(text_to_search)
    (start, end, key_payload) = t.findall_short(text_to_search)
    (start, end, key_payload) = t.findall_long(text_to_search)

Examples:
    >>> a = NoAho()
    >>> a.add('ms windows')
    >>> a.add('ms windows 2000', "this is canonical")
    >>> a.add('windows', None)
    >>> a.add('windows 2000')
    >>> text = 'windows 2000 ms windows 2000 windows'
    >>> for k in a.findall_short(text):
    ...     print text[k[0]:k[1]]
    ...
    windows
    ms windows
    windows
    >>> for k in a.findall_long(text):
    ...     print text[k[0]:k[1]]
    ...
    windows 2000
    ms windows 2000
    windows

Mapping (dictionary) methods:
    trie = NoAho()
    trie['apple'] = apple_handling_function
    trie['orange'] = Orange()
    trie.add('banana') # payload will be None
    trie['pear'] # will give key error
    assert isinstance(trie['orange'], Orange)
    assert 'banana' in trie
    len(trie)
    # No del;
    # no iteration over keys

The 'find[all]_short' forms are named as long and awkwardly as they are,
to leave plain 'find[all]' free if overlapping matches are ever implemented.


Deep Rebuilding:
Written in C++ and Cython.

You should not need to use Cython to build and use this module, but if
you want to change it and regenerate, there's a build command line at
the top of noaho.pyx.  To allow the generated noaho.cpp file to be
used in both Py2 and Py3, use a Python 3-built Cython. This was built
with Cython-0.15.1, but even as far back as 0.13 should work (for that
version though, you'd have to manually use a c++ compiler to do the
last step, of linking).


For the fullest spec of what the code will and will not so, check out
test-noaho.py (python[3] test-noaho.py)

Untested: whether the payload handling is complete, ie that there are no
memory leaks. It should be correct though.
 
File Type Py Version Uploaded on Size
NoAho-0.9.tar.gz (md5) Source 2012-03-20 38KB
  • Downloads (All Versions):
  • 15 downloads in the last day
  • 66 downloads in the last week
  • 339 downloads in the last month