delune

DeLune Python Object Storage and Search Engine

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Introduce

DeLune (former Wissen) is a simple fulltext search engine and Python object (similar with noSQL document concept) storage written in Python for logic thing and C for a core index/search module.

I had been studed Lucene earlier version with Lupy and CLucene. And I had maden my own search engine for excercise.

Its file format, numeric compressing algorithm, indexing process are quiet similar with Lucene earlier version (I don’t know about recent versions at all). But querying and result-fetching parts is built from my imagination. As a result it’s entirely unorthodox and possibly inefficient (I am a typical nerd and work-alone programmer ;-)

DeLune is a kind of hybrid of search engine and noSQL document database.

DeLune stores python objects with pickle-compresses serializing, then if you use DeLune as python module, you can store and get document derectly.

DeLune may be useful when it is allowed a few minutes gap on updating, inserting and deleting requests and operations. For example, it will be good for your legacy contents or generated by your own not by customer.

As most fulltext search engines, DeLune do always and only append data, no modification for existing files. So inserting, updating and deleting ops need high disk writing cost. Sometimes one small deletion op may trigger massive disk writing for optimization (even deleting cost itself is very low).

Anyway, if you need realtime changes on your data, DO BOT USE DeLune or complement with another type of NoSQL or RDBMS.

DeLune supports storing multiple documents for polymorphic use cases like listing and detail views. It is inefficient for storage usage, but helps reading performance.

DeLune’s searching mechanism is similar with DNA-RNA-Protein working model can be translated into ‘Index File-Temporary Small Replication Buffer-Query Result’.

Every searcher (Cell) has a single index file handlers group (DNA group in nuclear)
Thread has multiple small memory buffer (RNA) for replicating index as needed part
Query class (Ribosome) creates query result (Protein) by synthesising buffers’ inforamtion (RNAs) and Each thread has own memory space to create result set (Protein) not shared with other threads.
Repeat from 2nd if expected more results

And it provides storing, indexing and searching RESTful API through Skitai App Engine,

Work on multi processes environment
Master-slave replication
Untested (not yet) sharding, map-reducing, load-balancing using Skitai features

Installation

DeLune contains C extension, so need C compiler.

pip install delune

On posix, it might be required some packages,

apt-get install build-essential zlib1g-dev

Quick Start

All field text type should be str type, otherwise encoding should be specified.

Here’s an example indexing only one document.

import delune

dln = delune.connect ("/home/deune")
col = dln.create ("mycol", ["mycol"], 1)

with col.documents as D:
  song = "violin sonata in c k.301"
  birth = 1756

  d = D.new (100) # document ID
  d.content ([song, {'composer': 'mozart', 'birth': birth}])
  d.field ("default", song, delune.TEXT)
  d.field ("birth", birth, delune.INT16)
  d.snippet (song)
  D.add (d)
  D.commit ()
  D.index ()
  result = D.query ("violin")

Result will be like this:

{
 'code': 200,
 'time': 0,
 'total': 1
 'result': [
  [
   ['violin sonata in c k.301', {"composer": 'wofgang amadeus mozart', 'birth': 1756}], # content
   '<b>violin</b> sonata in c k.301', # auto snippet
   14, 0, 0, 0 # additional info
  ]
 ],
 'sorted': [None, 0],
 'regex': 'violin|violins',
}

DeLune’s document can be any Python objects picklalbe, delune stored document zipped pickled format. But you want to fetch partial documents by key or index, document skeleton shoud be a list or dictionary, but still inner data type can be any picklable objects. I think if your data need much more reading operations than writngs/updatings, DeLune can be as both simple schemaless data storage and fulltext search engine. DeLune’s RESTful API and replication is end of this document.

Configure DeLune

When indexing it’s not necessory to configure, but searching should be configured. The reason why DeLune allocates memory per thread for searching and classifying on initializing.

delune.configure (
  numthread,
  logger,
  io_buf_size = 4096,
  mem_limit = 256
)

numthread: number of threads which access to DeLune collections and models. if set to 8, you can open multiple collections (or models) and access with 8 threads. If 9th thread try to access to delune, it will raise error
logger: see next chapter
io_buf_size = 4096: Bytes size of flash buffer for repliacting index files
mem_limit = 256: Memory limit per a thread, but it’s not absolute. It can be over during calculation if need, but when calcuation has been finished, would return memory ASAP.

Finally when your app is terminated, call shutdown.

delune.shutdown ()

Indexing and Searching On Local Machine

Although quick start, we user indexer.index method for indxing documents, delune provide indexer as backend service.

Run Indexer as Service

# one timne indexing in console
delune index -v /home/delune

# indexing every 5minutes in console
delune index -v /home/delune -i 300

# indexing every 5 minutes as daemon
delune index -dv /home/delune -i 300

# restart indexing daemon every 5 minutes as daemon
delune index -v /home/delune -i 300 restart

# stop indexing daemon
delune index stop

# status of indexing daemon
delune index status

Using Client API

Connecting to Delune Resources

import delune

dln = delune.connect ("/home/delune")

As result, delune check anf create directories.

/home/delune/delune/config
/home/delune/delune/collections

Creating New Collection

col = dln.create ("mycol", ["mycol"], 1)
col.save ()

As result, collection created like this.

/home/delune/delune/config/mycol : JSON file contains configure options
/home/delune/delune/collections/mycol

If you use multiple disks for increasing speed or capacity of collection.

First of all mount your disks to /home/delune/delune/collections,

/home/delune/delune/collections/hdd0
/home/delune/delune/collections/hdd1

Then create collection.

col = dln.create ("mycol", ["hdd0/mycol", "hdd1/mycol"], 1)
col.save ()

As a result, collection will be created like this.

/home/delune/delune/collections/hdd0/mycol
/home/delune/delune/collections/hdd1/mycol

Your segment filess of collection will be created these directories randomly (with considering free space of disks).

Configuring Collection

There’re 2 way for configuring tour collections.

First, use col.config dictionalry.

col = dln.create ("mycol", ["mycol"], version = 1)
col.config

>> {
     'name': 'mycol',
     'data_dir': ["mycol"],
     "version": 1,

     'analyzer': {
       "max_terms": 3000,
       "stem_level": 1,
       "strip_html": 0,
       "make_lower_case": 1,
       "ngram": 1,
       "biword": 0,
       "stopwords_case_sensitive": 1,
       "ngram_no_space": 0,
       "contains_alpha_only": 0,
       "stopwords": [],
       "endwords": [],
     },
     'indexer': {
       'optimize': 1,
       'force_merge': 0,
       'max_memory': 10000000,
       'max_segments': 10,
       'lazy_merge': (0.3, 0.5),
     },
     'searcher': {
       'max_result': 2000,
       'num_query_cache': 1000
     }
   }

You just change values as you want.

Another way is set options when creating collection.

col = dln.create (
  "mycol",
  ["mycol"],
  version = 1,
  max_terms = 5000,
  strip_html = 1,
  force_merge = 1,
  max_result = 10000
)

For more detail for analyzer, indexer and searcher options, see Low Level API section.

Adding Dcouments To Collection

with col.documents as D:
  for code, title in my_codes:
    d = D.new (code) # code is used as document ID
    d.content ([code, title])
    d.field ("code", code, delune.STRING)
    d.field ("default", title, delune.TEXT)
    D.add (d)
  D.commit ()

It is important to understand, above operation actually dosen’t make any change to your collection. It just saves your documents at:

/home/delune/delune/collections/mycol/.que/

If you commit multiple time, que files will be created as you commit.

Adding Dcouments Without ID

d = D.new ()

Note that in this case you canmoy update/modify your documents.

Deleting Dcouments From Collection

If your document has ID,

with col.documents as D:
  for code, title in my_codes:
    D.delete (code)
  D.commit ()

Else,

with col.documents as D:
  D.qdelete ("milk")
  D.commit ()

It will be deleted all documents contain ‘milk’.

Indexing

If you run delune indexer, these saved documents will be automatically indexed. Or you can index mannually,

delune index -v /home/delune

Searching

dln = delune.connect ("/home/delune")
col = dln.load ("mycol")
with col.documents as D:
  D.search ("violin")

search() spec is:

D.search (
  q,
  offset = 0,
  limit = 10,
  sort = "", # INT field name
  snippet = 30, # number of terms for snippet
  partial = "", # specify index or key of a content
  nthdoc = 0, # specify index of contents
  lang = "un",
  analyze = 1, # query terms are already analyzed, set to 0
  data = 1, # whether or not return content part
  qlimit = 1 # whether or not apply limitation for number of searched documents by max_result
)

Truncating Documents

col.documents.truncate ("mycol")
col.documents.commit ()

Drop Collection

col.drop (include_data = True)

Indexing and Searching On Remote Machine

You can make remote delune resource.

Running RESTful API

New in version 0.12.14

You can use RESTful API with Skitai App Engine for your remote machine.

First of all, you need to install skitai by,

pip3 install -U skitai

Then copy and save below code to app.py.

import os
import delune
import skitai

if __name__ == "__main__":
  pref = skitai.pref ()
  pref.use_reloader = 1
  pref.debug = 1

  config = pref.config
  config.resource_dir = "/home/delune"

  skitai.trackers ('delune:collection')
  skitai.mount ("/", delune, "app", pref)
  skitai.run (
    workers = 2,
    threads = 4,
    port = 5000
  )

And run,

app.py

So you can access to http://<your IP address>:5000/v1

For more detail about API, see app.py.

Run Indexer as Service

And like local, you shoud run indexer,

delune index -dv /home/delune -i 300

This will index committed documents every 5 minutes.

Using Client API

It is exactly same as local API except connect parameter. parameter should starts with “http://” or “https://” and ends with version string like “v1”

dln = delune.connect ("http://192.168.0.200:5000/v1")
col = dln.create ("mycol", ["mycol"], 1)
col.save ()
...

Note that you need not reun indexer background at your local machine any more.

Replicating Delune Resources

You can run replica server for distributed search or backup.

Replicating Your Collection

# replicate every 5 minutes from http://192.168.0.200/v1
delune replicate -o http://192.168.0.200/v1 -i 300

As a result, all remote delune resources will be replicated with exactly same directory structure.

Limitation

Before you test DeLune, you should know some limitation.

DeLune search cannot sort by string type field, but can by int/bit/coord types and TFIDF ranking.

Low Level API

Logger

from delune.lib import logger

logger.screen_logger ()

# it will create file '/var/log.delune.log', and rotated by daily base
logger.rotate_logger ("/var/log", "delune", "daily")

Standard Analyzer

Analyzer is needed by TEXT, TERM types.

Basic Usage is:

analyzer = delune.standard_analyzer (
  max_term = 8,
  numthread = 1,
  ngram = True or False,
  stem_level = 0, 1 or 2 (2 is only applied to English Language),
  make_lower_case = True or False,
  stopwords_case_sensitive = True or False,
  ngram_no_space = True or False,
  strip_html = True or False,
  contains_alpha_only = True or False,
  stopwords = [word,...]
)

stem_level: 0 and 1, especially ‘en’ language has level 2 for hard stemming
make_lower_case: make lower case for every text
stopwords_case_sensitive: it will work if make_lower_case is False
ngram_no_space: if False, ‘泣斬馬謖’ will be tokenized to _泣, 泣斬, 斬_, _馬, 馬謖, 謖_. But if True, addtional bi-gram 斬馬 will be created between 斬_ and _馬.
strip_html
contains_alpha_only: remove term which doesn’t contain alphabet, this option is useful for full-text training in some cases
stopwords: DeLune has only English stopwords list, You can use change custom stopwords. Stopwords sould be unicode or utf8 encoded bytes

DeLune has some kind of stemmers and n-gram methods for international languages and can use them by this way:

analyzer = standard_analyzer (ngram = True, stem_level = 1)
col = delune.collection ("./col", delune.CREATE, analyzer)
indexer = col.get_indexer ()
document.field ("default", song, delune.TEXT, lang = "en")

Implemented Stemmers

Except English stemmer, all stemmers can be obtained at IR Multilingual Resources at UniNE.

ar: Arabic

de: German

en: English

es: Spanish

fi: Finnish

fr: French

hu: Hungarian

it: Italian

pt: Portuguese

sv: Swedish

Bi-Gram Index

If ngram is set to True, these languages will be indexed with bi-gram.

cn: Chinese

ja: Japanese

ko: Korean

Also note that if word contains only alphabet, will be used English stemmer.

Tri-Gram Index

The other languages will be used English stemmer if all spell is Alphabet. And if ngram is set to True, will be indexed with tri-gram if word has multibytes.

Methods Spec

analyzer.index (document, lang)

analyzer.freq (document, lang)

analyzer.stem (document, lang)

analyzer.count_stopwords (document, lang)

Collection

Collection manages index files, segments and properties.

col = delune.collection (
  indexdir = [dirs],
  mode = [ CREATE | READ | APPEND ],
  analyzer = None,
  logger = None
)

indexdir: path or list of path for using multiple disks efficiently
mode
analyzer
logger: # if logger configured by delune.configure, it’s not necessary

Collection has 2 major class: indexer and searcher.

Indexer

For searching documents, it’s necessary to indexing text to build Inverted Index for fast term query.

indexer = col.get_indexer (
  max_segments = int,
  force_merge = True or False,
  max_memory = 10000000 (10Mb),
  optimize = True or False
)

max_segments: maximum number of segments of index, if it’s over, segments will be merged. also note during indexing, segments will be created 3 times of max_segments and when called index.close (), automatically try to merge until segemtns is proper numbers
force_merge: When called index.close (), forcely try to merge to a single segment. But it’s failed if too big index - on 32bit OS > 2GB, 64bit > 10 GB
max_memory: if it’s over, created new segment on indexing
optimize: When called index.close (), segments will be merged by optimal number as possible

For add docuemtn to indexer, create document object:

document = delune.document ()

DeLune handle 3 objects as completly different objects between no relationship

returning content
snippet generating field
searcherble fields

Set Returning Content

DeLune serialize returning contents by pickle, so you can set any objects pickle serializable.

document.content ({"userid": "hansroh", "preference": {"notification": "email", ...}})

or

document.content ([32768, "This is smaple ..."])

For saving multiple contents,

document.content ({"userid": "hansroh", "preference": {"notification": "email", ...}})
document.content ([32768, "This is smaple ..."])

You can select one of these by query time using nthdoc=0 or 1 parameter.

Snippet Generating Field

This field should be unicode/utf8 encoded bytes.

document.snippet ("This is sample...")

Searchable Fields

document also recieve searchable fields:

document.field (name, value, ftype = delune.TEXT, lang = "un", encoding = None)

document.field ("default", "violin sonata in c k.301", delune.TEXT, "en")
document.field ("composer", "wolfgang amadeus mozart", delune.TEXT, "en")
document.field ("lastname", "mozart", delune.STRING)
document.field ("birth", 1756, delune.INT16)
document.field ("genre", "01011111", delune.BIT8)
document.field ("home", "50.665629/8.048906", delune.COORD6)

name: if ‘default’, this field will be searched by simple string, or use ‘name:query_text’
value: unicode/utf8 encode text, or should give encoding arg.
ftype: see below
encoding: give like ‘iso8859-1’ if value is not unicode/utf8
lang: language code for standard_analyzer, “un” (unknown) is default

Avalible Field types are:

TEXT: analyzable full-text, result-not-sortable

TERM: analyzable full-text but position data will not be indexed as result can’t search phrase, result-not-sortable

STRING: exactly string match like nation codes, result-not-sortable

LIST: comma seperated STRING, result-not-sortable

FNUM: foramted number, value should be int or float and format parameter required, format is “digit.digit” that number of digit interger part with zero leading, and number of float part length. It make possible to search range efficiently.

COORDn, n=4,6,8 decimal precision: comma seperated string ‘latitude,longititude’, latitude and longititude sould be float type range -90 ~ 90, -180 ~ 180. n is precision of coordinates. n=4 is 10m radius precision, 6 is 1m and 8 is 10cm. result-sortable

BITn, n=8,16,24,32,40,48,56,64: bitwise operation, bit makred string required by n, result-sortable

INTn, n=8,16,24,32,40,48,56,64: range, int required, result-sortable

Note1: You make sure COORD, INT and BIT fields are at every documents even they havn’t got a value, because these types are depend on document indexed sequence ID. If they have’t a value, please set value to None NOT omit fields.

Note2: FNUM 100.12345 with format=”5.3” is interanlly converted into “00100.123” and negative value will be -00100.123 and MAKE SURE your values are within -99999.999 and 99999.999.

Repeat add_document as you need and close indexer.

for ...:
  document = delune.document ()
  ...
  indexer.add_document (document)
  indexer.close ()

If searchers using this collection runs with another process or thread, searcher automatically reloaded within a few seconds for applying changed index.

Searcher

For running searcher, you should delune.configure () first and creat searcher.

searcher = col.get_searcher (
  max_result = 2000,
  num_query_cache = 200
)

max_result: max returned number of searching results. default 2000, if set to 0, unlimited results
num_query_cache: default is 200, if over 200, removed by access time old

Query is simple:

searcher.query (
  qs,
  offset = 0,
  fetch = 10,
  sort = "tfidf",
  summary = 30,
  lang = "un"
)

qs: string (unicode) or utf8 encoded bytes. for detail query syntax, see below
offset: return start position of result records
fetch: number of records from offset
sort: “(+-)tfidf” or “(+-)field name”, field name should be int/bit type, and ‘-’ means descending (high score/value first) and default if not specified. if sort is “”, records order is reversed indexing order
summary: number of terms for snippet
lang: default is “un” (unknown)

For deleting indexed document:

searcher.delete (qs)

All documents will be deleted immediatly. And if searchers using this collection run with another process or thread, theses searchers automatically reloaded within a few seconds.

Finally, close searcher.

searcher.close ()

Query Syntax

violin composer:mozart birth:1700~1800

search ‘violin’ in default field, ‘mozart’ in composer field and search range between 1700, 1800 in birth field

violin allcomposer:wolfgang mozart

search ‘violin’ in default field and any terms after allcomposer will be searched in composer field

violin -sonata birth2:1700~1800

birth2 is between ‘1700’ and ‘1800’

violin -sonata birth:~1800

not contain sonata in default field

violin -composer:mozart

not contain mozart in composer field

violin or piano genre:00001101/all

matched all 5, 6 and 8th bits are 1. also /any or /none is available

violin or ((piano composer:mozart) genre:00001101/any)

support unlimited priority ‘()’ and ‘or’ operators

(violin or ((allcomposer:mozart wolfgang) -amadeus)) sonata (genre:00001101/none home:50.6656,8.0489~10000)

search home location coordinate (50.6656, 8.0489) within 10 Km

“violin sonata” genre:00001101/none home:50.6656/8.0489~10

search exaclt phrase “violin sonata”

“violin^3 piano” -composer:”ludwig van beethoven”

search loose phrase “violin sonata” within 3 terms

Migrating From Version 0.3x

Upgdare linraries

pip3 install -U skitai quests delune

Then restructuring directories

DELUNE_ROOT="/home/delune"
mkdir "$DELUNE_ROOT/delune"
mv "$DELUNE_ROOT/models/.config" "$DELUNE_ROOT/delune/config"
mv "$DELUNE_ROOT/models" "$DELUNE_ROOT/delune/collections"

Edit your all config, remove models/ fro your data_dir option.

"data_dir": ["models/mycols"]
=>  "data_dir": ["mycols"]

If you use RESTful API service, remove index or mirror related code lines at your app app launch script.

Finally, run indexer.

delune index -dv /home/delune -i 300

Change Log

0.4 (June 2, 2018)

officially seized developing naivebayes classifier & learner

integrated local and remote indexing and searching APIs

directory structure is NOT compatible with version 0.3x

0.3 (Sep 15, 2017)

fix wildcard & range search

fix snippet thing

add stem API

add index field aliasing to document

add string range searching, add new field type: ZFn

add multiple documents storing feature. as a result, DeLune can read only for Wissen collections

0.2 (Sep 14, 2017)

fix minor bugs

0.1 (Sep 13, 2017)

change package name from Wissen to DeLune

Earlier Wissen Period

0.13

fix using lock

add truncate collection API

fix updating document

change replicating way to use sticky session connection with origin server

fix file creation mode on posix

fix using lock with multiple workers

change wissen.document method names

fix index queue file locking

0.12

add biword arg to standard_analyzer

change export package name from appack to package

add Skito-Saddle app

fix analyzer.count_stopwords return value

change development status to Alpha

add wissen.assign(alias, searcher/classifier) and query(alias), guess(alias)

fix threads count and memory allocation

add example for Skitai App Engine app to mannual

0.11

fix HTML strip and segment merging etc.

add MULTIPATH classifier

add learner.optimize ()

make learner.build & learner.train efficient

0.10 - change version format, remove all str*_s ()

0.9 - support Python 3.x

0.8 - change license from BSD to GPL V3

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.4b27 pre-release

Oct 2, 2022

0.4b26 pre-release

Oct 1, 2022

0.4b25 pre-release

Sep 29, 2022

0.4b23 pre-release

Aug 5, 2022

0.4b22 pre-release

Aug 5, 2022

0.4b21 pre-release

Apr 25, 2022

0.4b20 pre-release

Nov 7, 2021

0.4b18 pre-release

May 3, 2021

0.4b17 pre-release

May 3, 2021

0.4b16 pre-release

May 2, 2021

0.4b15 pre-release

May 1, 2021

0.4b14 pre-release

Sep 5, 2020

0.4b13 pre-release

Aug 16, 2020

0.4b12 pre-release

Oct 25, 2019

0.4b11 pre-release

Oct 25, 2019

0.4b10 pre-release

Jun 5, 2019

0.4b9 pre-release

Mar 24, 2019

0.4b8 pre-release

Feb 25, 2019

0.4b7 pre-release

Feb 15, 2019

0.4b5 pre-release

Jan 16, 2019

0.4b4 pre-release

Jun 3, 2018

0.4b3 pre-release

Jun 3, 2018

0.4b2 pre-release

Jun 2, 2018

0.4b1 pre-release

Jun 2, 2018

This version

0.3.2.1

Feb 9, 2019

0.3.2

Feb 7, 2019

0.3.1.7

Feb 10, 2018

0.3.1.6

Oct 25, 2017

0.3.1.5

Oct 24, 2017

0.3.1.4

Oct 23, 2017

0.3.1.3

Sep 26, 2017

0.3.1.2

Sep 24, 2017

0.3.1.1

Sep 23, 2017

0.3.1

Sep 23, 2017

0.3.1b9 pre-release

Sep 23, 2017

0.3.1b8 pre-release

Sep 22, 2017

0.3.1b7 pre-release

Sep 19, 2017

0.3.1b6 pre-release

Sep 18, 2017

0.3.1b5 pre-release

Sep 17, 2017

0.3.1b4 pre-release

Sep 17, 2017

0.3.1b3 pre-release

Sep 16, 2017

0.3.1b2 pre-release

Sep 16, 2017

0.3.1b1 pre-release

Sep 16, 2017

0.3.0.2

Sep 15, 2017

0.3b3 pre-release

Sep 15, 2017

0.3b2 pre-release

Sep 15, 2017

0.3b1 pre-release

Sep 15, 2017

0.2.1.1b1 pre-release

Sep 14, 2017

0.2.1

Sep 14, 2017

0.2

Sep 14, 2017

0.2b1 pre-release

Sep 14, 2017

0.1

Sep 13, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delune-0.3.2.1.tar.gz (1.7 MB view hashes)

Uploaded Feb 9, 2019 Source

Hashes for delune-0.3.2.1.tar.gz

Hashes for delune-0.3.2.1.tar.gz
Algorithm	Hash digest
SHA256	`6082b2f5bb61ec4e5368c830025f438fff979383fd600ba4b6f15065c408ee2d`
MD5	`ed4648ef7997e2f3622848f4195f4469`
BLAKE2b-256	`d3bacfa6264835143a19abd19afb672075c4b42bc2983daefb7aae656ff2ed03`