Skip to main content

Wavelet Matrix/Tree succinct data structure for full text search (using shellinford C++ library)

Project description

shellinford

travis-ci.org coveralls.io pyversion latest version license

Shellinford is an implementation of a Wavelet Matrix/Tree succinct data structure for document retrieval.

It is based on shellinford C++ library.

NOTE: This module requires C++11 compiler

Installation

$ pip install shellinford

Usage

Create a new FM-index instance

>>> import shellinford
>>> fm = shellinford.FMIndex()
  • shellinford.Shellinford([use_wavelet_tree=True, filename=None])

    • When given a filename, Shellinford loads FM-index data from the file

Build FM-index

>>> fm.build(['Milky Holmes', 'Sherlock "Sheryl" Shellingford', 'Milky'], 'milky.fm')
  • build([docs, filename])

    • When given a filename, Shellinford stores FM-index data to the file

Search word from FM-index

>>> for doc in fm.search('Milky'):
>>>     print('doc_id:', doc.doc_id)
>>>     print('count:', doc.count)
>>>     print('text:', doc.text)
doc_id: 0
count: [1]
text: Milky Holmes
doc_id: 2
count: [1]
text: Milky

>>> for doc in fm.search(['Milky', 'Holmes']):
>>>     print('doc_id:', doc.doc_id)
>>>     print('count:', doc.count)
>>>     print('text:', doc.text)
doc_id: 1
count: [1]
text: Milky Holmes
  • search(query, [_or=False, ignores=[]])

    • If _or = True, then “OR” search is executed, else “AND” search

    • Given ignores, “NOT” search is also executed

    • NOTE: The search function is available after FM-index is built or loaded

Count word from FM-index

>>> fm.count('Milky'):
2

>>> fm.count(['Milky', 'Holmes']):
1
  • count(query, [_or=False])

    • If _or = True, then “OR” search is executed, else “AND” search

    • NOTE: The count function is available after FM-index is built or loaded

    • This function is slightly faster than the search function

Add a document

>>> fm.push_back('Baritsu')
  • push_back(doc)

    • NOTE: A document added by this method is not available to search until build

Read FM-index from a binary file

>>> fm.read('milky_holmes.fm')
  • read(path)

Write FM-index binary to a file

>>> fm.write('milky_holmes.fm')
  • write(path)

Check Whether FM-Index contains string

>>> 'baritsu' in fm

License

  • Wrapper code is licensed under the New BSD License.

  • Bundled shellinford C++ library (c) 2012 echizen_tm is licensed under the New BSD License.

CHANGES

0.4.1 (2010-02-08)

  • Make “in” operator faster

0.4.0 (2018-09-30)

  • FMIndex.count() is added

  • No longer support Python 2.6

  • bug fix

0.3.5 (2018-09-05)

  • FMIndex.build() and FMIndex.pushback() ignore empty string

  • FMIndex supports “in” operator. (e.g., ‘a’ in fm)

  • Support Python 3.5, 3.6 and 3.7

0.3.4 (2016-10-28)

  • FMIndex.search() returns list

0.3 (2014-11-24)

  • “OR” search and “NOT” search are available in FMIndex.search().

  • FMIndex.size and FMIndex.docsize are available as property

0.2 (2014-03-28)

“AND” search is available by giving Sequence (list, tuple, etc.) FMIndex.search()

0.1 (2014-03-11)

First release.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page