delune

DeLune Full-Text Search & Classification Engine

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Introduce

DeLune (former Wissen) Search Engine is a simple search engine mostly written in Python and C in year 2008.

At that time, I would like to study Lucene earlier version with Lupy and CLucene. And I also had maden my own search engine for excercise.

Its file format, numeric compressing algorithm, indexing process are quiet similar with Lucene. But querying and result-fetching parts was built from my imagination. As a result it’s entirely unorthodox and possibly inefficient. DeLune’s searching mechanism is similar with DNA-RNA-Protein working model translated into ‘Index File-Temporary Small Replication Buffer-Query Result’.

Every searcher (Cell) has a single index file handlers group (DNA group in nuclear)
Thread has multiple small buffer (RNA) for replicating index as needed part
Query class (Ribosome) creates query result (Protein) by synthesising buffers’ inforamtion (RNAs)
Repeat from 2nd if expected more results

Installation

DeLune contains C extension, so need C compiler.

pip install delune

On posix, it might be required some packages,

apt-get install gcc zlib1g-dev

Quick Start

All field text type should be str type, otherwise encoding should be specified.

Indexing and Searching

Here’s an example indexing only one document.

import delune

# indexing
analyzer = delune.standard_analyzer (max_term = 3000)
col = delune.collection ("./col", delune.CREATE, analyzer)
indexer = col.get_indexer ()

song = "violin sonata in c k.301"
composer = u"wolfgang amadeus mozart"
birth = 1756
home = "50.665629/8.048906" # Lattitude / Longitude of Salzurg
genre = "01011111" # (rock serenade jazz piano symphony opera quartet sonata)

document = delune.document ()

# object to return, any object serializable by pickle
document.content ([song, composer])

# text content to generating auto snippet by given query terms
document.snippet (song)

# add searchable fields
document.field ("default", song, delune.TEXT)
document.field ("composer", composer, delune.TEXT)
document.field ("birth", birth, delune.INT16)
document.field ("genre", genre, delune.BIT8)
document.field ("home", home, delune.COORD)

indexer.add_document (document)
indexer.close ()

# searching
analyzer = delune.standard_analyzer (max_term = 8)
col = delune.collection ("./col", delune.READ, analyzer)
searcher = col.get_searcher ()
print searcher.query (u'violin', offset = 0, fetch = 2, sort = "tfidf", summary = 30)
searcher.close ()

Result will be like this:

{
 'code': 200,
 'time': 0,
 'total': 1
 'result': [
  [
   ['violin sonata in c k.301', 'wofgang amadeus mozart'], # content
   '<b>violin</b> sonata in c k.301', # auto snippet
   14, 0, 0, 0 # additional info
  ]
 ],
 'sorted': [None, 0],
 'regex': 'violin|violins',
}

DeLune’s document can be any Python objects pickalbe, delune stored document zipped pickled format. But you want to fetch partial documents by key or index, document skeleton shoud be a list or dictionary, but still inner data type can be any picklable objects. I think if your data need much more reading operations than writngs/updatings, DeLune can be as both simple schemaless data storage and fulltext search engine. DeLune’s RESTful API and replication is end of this document.

Learning and Classification

Here’s an example guessing one of ‘play golf’, ‘go to bed’ by weather conditions.

import delune

analyzer = delune.standard_analyzer (max_term = 3000)

# learning

mdl = delune.model ("./mdl", delune.CREATE, analyzer)
learner = mdl.get_learner ()

document = delune.labeled_document ("Play Golf", "cloudy windy warm")
learner.add_document (document)
document = delune.labeled_document ("Play Golf", "windy sunny warm")
learner.add_document (document)
document = delune.labeled_document ("Go To Bed", "cold rainy")
learner.add_document (document)
document = delune.labeled_document ("Go To Bed", "windy rainy warm")
learner.add_document (document)
learner.close ()

mdl = delune.model ("./mdl", delune.MODIFY, analyzer)
learner = mdl.get_learner ()
learner.listbydf () # show all terms with DF (Document Frequency)
learner.close ()

mdl = delune.model ("./mdl", delune.MODIFY, analyzer)
learner = mdl.get_learner ()
learner.build (dfmin = 2) # build corpus DF >= 2
learner.close ()

mdl = delune.model ("./mdl", delune.MODIFY, analyzer)
learner = mdl.get_learner ()
learner.train (
  cl_for = delune.ALL, # for which classifier
  selector = delune.CHI2, # feature selecting method
  select = 0.99, # how many features?
  orderby = delune.MAX, # feature ranking by what?
  dfmin = 2 # exclude DF < 2
)
learner.close ()


# gusessing

mdl = delune.model ("./mdl", delune.READ, analyzer)
classifier = mdl.get_classifier ()
print classifier.guess ("rainy cold", cl = delune.NAIVEBAYES)
print classifier.guess ("rainy cold", cl = delune.FEATUREVOTE)
print classifier.guess ("rainy cold", cl = delune.TFIDF)
print classifier.guess ("rainy cold", cl = delune.SIMILARITY)
print classifier.guess ("rainy cold", cl = delune.ROCCHIO)
print classifier.guess ("rainy cold", cl = delune.MULTIPATH)
print classifier.guess ("rainy cold", cl = delune.META)
classifier.close ()

Result will be like this:

{
  'code': 200,
  'total': 1,
  'time': 5,
  'result': [('Go To Bed', 1.0)],
  'classifier': 'meta'
}

Limitation

Before you test DeLune, you should know some limitation.

DeLune search cannot sort by string type field, but can by int/bit/coord types and TFIDF ranking.
DeLune classification doesn’t have purpose for accuracy but realtime (means within 1 second) guessing performance. So I used relatvely simple and fast classification algorithms. If you need accuracy, it’s not fit to you.

Configure DeLune

When indexing/learing it’s not necessory to configure, but searching/guessing it should be configure. The reason why DeLune allocates memory per thread for searching and classifying on initializing.

delune.configure (
  numthread,
  logger,
  io_buf_size = 4096,
  mem_limit = 256
)

numthread: number of threads which access to DeLune collections and models. if set to 8, you can open multiple collections (or models) and access with 8 threads. If 9th thread try to access to delune, it will raise error
logger: see next chapter
io_buf_size = 4096: Bytes size of flash buffer for repliacting index files
mem_limit = 256: Memory limit per a thread, but it’s not absolute. It can be over during calculation if need, but when calcuation has been finished, would return memory ASAP.

Finally when your app is terminated, call shutdown.

delune.shutdown ()

Logger

from delune.lib import logger

logger.screen_logger ()

# it will create file '/var/log.delune.log', and rotated by daily base
logger.rotate_logger ("/var/log", "delune", "daily")

Standard Analyzer

Analyzer is needed by TEXT, TERM types.

Basic Usage is:

analyzer = delune.standard_analyzer (
  max_term = 8,
  numthread = 1,
  ngram = True or False,
  stem_level = 0, 1 or 2 (2 is only applied to English Language),
  make_lower_case = True or False,
  stopwords_case_sensitive = True or False,
  ngram_no_space = True or False,
  strip_html = True or False,
  contains_alpha_only = True or False,
  stopwords = [word,...]
)

stem_level: 0 and 1, especially ‘en’ language has level 2 for hard stemming
make_lower_case: make lower case for every text
stopwords_case_sensitive: it will work if make_lower_case is False
ngram_no_space: if False, ‘泣斬馬謖’ will be tokenized to _泣, 泣斬, 斬_, _馬, 馬謖, 謖_. But if True, addtional bi-gram 斬馬 will be created between 斬_ and _馬.
strip_html
contains_alpha_only: remove term which doesn’t contain alphabet, this option is useful for full-text training in some cases
stopwords: DeLune has only English stopwords list, You can use change custom stopwords. Stopwords sould be unicode or utf8 encoded bytes

DeLune has some kind of stemmers and n-gram methods for international languages and can use them by this way:

analyzer = standard_analyzer (ngram = True, stem_level = 1)
col = delune.collection ("./col", delune.CREATE, analyzer)
indexer = col.get_indexer ()
document.field ("default", song, delune.TEXT, lang = "en")

Implemented Stemmers

Except English stemmer, all stemmers can be obtained at IR Multilingual Resources at UniNE.

ar: Arabic

de: German

en: English

es: Spanish

fi: Finnish

fr: French

hu: Hungarian

it: Italian

pt: Portuguese

sv: Swedish

Bi-Gram Index

If ngram is set to True, these languages will be indexed with bi-gram.

cn: Chinese

ja: Japanese

ko: Korean

Also note that if word contains only alphabet, will be used English stemmer.

Tri-Gram Index

The other languages will be used English stemmer if all spell is Alphabet. And if ngram is set to True, will be indexed with tri-gram if word has multibytes.

Methods Spec

analyzer.index (document, lang)

analyzer.freq (document, lang)

analyzer.stem (document, lang)

analyzer.count_stopwords (document, lang)

Collection

Collection manages index files, segments and properties.

col = delune.collection (
  indexdir = [dirs],
  mode = [ CREATE | READ | APPEND ],
  analyzer = None,
  logger = None
)

indexdir: path or list of path for using multiple disks efficiently
mode
analyzer
logger: # if logger configured by delune.configure, it’s not necessary

Collection has 2 major class: indexer and searcher.

Indexer

For searching documents, it’s necessary to indexing text to build Inverted Index for fast term query.

indexer = col.get_indexer (
  max_segments = int,
  force_merge = True or False,
  max_memory = 10000000 (10Mb),
  optimize = True or False
)

max_segments: maximum number of segments of index, if it’s over, segments will be merged. also note during indexing, segments will be created 3 times of max_segments and when called index.close (), automatically try to merge until segemtns is proper numbers
force_merge: When called index.close (), forcely try to merge to a single segment. But it’s failed if too big index - on 32bit OS > 2GB, 64bit > 10 GB
max_memory: if it’s over, created new segment on indexing
optimize: When called index.close (), segments will be merged by optimal number as possible

For add docuemtn to indexer, create document object:

document = delune.document ()

DeLune handle 3 objects as completly different objects between no relationship

returning content
snippet generating field
searcherble fields

Returning Content

DeLune serialize returning contents by pickle, so you can set any objects pickle serializable.

document.content ({"userid": "hansroh", "preference": {"notification": "email", ...}})

or

document.content ([32768, "This is smaple ..."])

Snippet Generating Field

This field should be unicode/utf8 encoded bytes.

document.snippet ("This is sample...")

Searchable Fields

document also recieve searchable fields:

document.field (name, value, ftype = delune.TEXT, lang = "un", encoding = None)

document.field ("default", "violin sonata in c k.301", delune.TEXT, "en")
document.field ("composer", "wolfgang amadeus mozart", delune.TEXT, "en")
document.field ("lastname", "mozart", delune.STRING)
document.field ("birth", 1756, delune.INT16)
document.field ("genre", "01011111", delune.BIT8)
document.field ("home", "50.665629/8.048906", delune.COORD6)

name: if ‘default’, this field will be searched by simple string, or use ‘name:query_text’
value: unicode/utf8 encode text, or should give encoding arg.
ftype: see below
encoding: give like ‘iso8859-1’ if value is not unicode/utf8
lang: language code for standard_analyzer, “un” (unknown) is default

Avalible Field types are:

TEXT: analyzable full-text, result-not-sortable

TERM: analyzable full-text but position data will not be indexed as result can’t search phrase, result-not-sortable

STRING: exactly string match like nation codes, result-not-sortable

LIST: comma seperated STRING, result-not-sortable

COORDn, n=4,6,8 decimal precision: comma seperated string ‘latitude,longititude’, latitude and longititude sould be float type range -90 ~ 90, -180 ~ 180. n is precision of coordinates. n=4 is 10m radius precision, 6 is 1m and 8 is 10cm. result-sortable

BITn, n=8,16,24,32,40,48,56,64: bitwise operation, bit makred string required by n, result-sortable

INTn, n=8,16,24,32,40,48,56,64: range, int required, result-sortable

Repeat add_document as you need and close indexer.

for ...:
  document = delune.document ()
  ...
  indexer.add_document (document)
  indexer.close ()

If searchers using this collection runs with another process or thread, searcher automatically reloaded within a few seconds for applying changed index.

Searcher

For running searcher, you should delune.configure () first and creat searcher.

searcher = col.get_searcher (
  max_result = 2000,
  num_query_cache = 200
)

max_result: max returned number of searching results. default 2000, if set to 0, unlimited results
num_query_cache: default is 200, if over 200, removed by access time old

Query is simple:

searcher.query (
  qs,
  offset = 0,
  fetch = 10,
  sort = "tfidf",
  summary = 30,
  lang = "un"
)

qs: string (unicode) or utf8 encoded bytes. for detail query syntax, see below
offset: return start position of result records
fetch: number of records from offset
sort: “(+-)tfidf” or “(+-)field name”, field name should be int/bit type, and ‘-’ means descending (high score/value first) and default if not specified. if sort is “”, records order is reversed indexing order
summary: number of terms for snippet
lang: default is “un” (unknown)

For deleting indexed document:

searcher.delete (qs)

All documents will be deleted immediatly. And if searchers using this collection run with another process or thread, theses searchers automatically reloaded within a few seconds.

Finally, close searcher.

searcher.close ()

Query Syntax

violin composer:mozart birth:1700~1800

search ‘violin’ in default field, ‘mozart’ in composer field and search range between 1700, 1800 in birth field

violin allcomposer:wolfgang mozart

search ‘violin’ in default field and any terms after allcomposer will be searched in composer field

violin -sonata birth:~1800

not contain sonata in default field

violin -composer:mozart

not contain mozart in composer field

violin or piano genre:00001101/all

matched all 5, 6 and 8th bits are 1. also /any or /none is available

violin or ((piano composer:mozart) genre:00001101/any)

support unlimited priority ‘()’ and ‘or’ operators

(violin or ((allcomposer:mozart wolfgang) -amadeus)) sonata (genre:00001101/none home:50.6656,8.0489~10000)

search home location coordinate (50.6656, 8.0489) within 10 Km

“violin sonata” genre:00001101/none home:50.6656/8.0489~10

search exaclt phrase “violin sonata”

“violin^3 piano” -composer:”ludwig van beethoven”

search loose phrase “violin sonata” within 3 terms

Model

Model manages index, train files, segments and properties.

mdl = delune.model (
  indexdir = [dirs],
  mode = [ CREATE | READ | MODIFY | APPEND ],
  analyzer = None,
  logger = None
)

Learner

For building model, on DeLune, there’re 3 steps need.

Step I. Index documents to learn
Step II. Build Corpus
Step III. Selecting features and save trained model

Step I. Index documents

Learner use delune.labeled_document, not delune.document. And can additional searchable fields if you need. Label is name of category.

learner = mdl.get_learner ()
for label, document in trainset:

  labeled_document = delune.labeled_document (label, document)
  # addtional searcherble fields if you need
  labeled_document.field (name, value, ftype = TEXT, lang = "un", encoding = None)
  learner.add_document (labeled_document)

learner.close ()

Step II. Building Corpus

Document Frequency (DF) is one of major factor of classifier. Low DF is important to searching but not to classifier. One of importance part of learning is selecting valuable terms, but so low DF terms is not very helpful for classifying new document because new document has also low probablity of appearance.

So for learnig/classification efficient, it’s useful to eliminate too low and too high DF terms. For example, Let’s assume you index 30,000 web pages for learing and there’re about 100,000 terms. If you build corpus with all terms, it takes so long time for learing. But if you remove DF < 10 and DF > 7000 terms, 75% - 80% of all terms will be removed.

# reopen model with MODIFY
mdl = delune.model (indexdir, MODIFY)
learner = mdl.get_learner ()

# show terms order by DF for examin
learner.listbydf (dfmin = 10, dfmax = 7000)

# build corpus and save
learner.build (dfmin = 10, dfmax = 7000)

As a result, corpus built with about 25,000 terms. It will take time by number of terms.

Step III. Feature Selecting and Saving Model

Features means most valuable terms to classify new documents. It is important understanding many/few features is not good for best result. It maybe most important to select good features for classification.

For example of my URL classification into 2 classes works show below results. Classifier is NAIVEBAYES, selector is GSS and min DF is 2. Train set is 20,000, test set is 2,000.

features 3,000 => 82.9% matched, 73 documents is unclassified

features 2,000 => 82.9% matched, 73 documents is unclassified

features 1,500 => 83.4% matched, 75 documents is unclassified

features 1,000 => 83.6% matched, 79 documents is unclassified

features 500 => 83.1% matched, 86 documents is unclassified

features 200 => 81.1% matched, 108 documents is unclassified

features 50 => 76.0% matched, 155 documents is unclassified

features 10 => 58.7% matched, 326 documents is unclassified

As results show us that over 2,000 snd under 1,000 features will be unchanged or degraded for classification quality. Also to the most classifiers, too few features increase unclassified ratio but especially to NAIVEBAYES, too many features will increase unclassified ratio cause of its calculating way.

mdl = delune.model (indexdir, MODIFY)
learner = mdl.get_learner ()

learner.train (
  cl_for = [
    ALL (default) | NAIVEBAYES | FEATUREVOTE |
    TFIDF | SIMILARITY | ROCCHIO | MULTIPATH
  ],
  select = number of features if value is > 1 or ratio,
  selector = [
    CHI2 | GSS | DF | NGL | MI | TFIDF | IG | OR |
    OR4P | RS | LOR | COS | PPHI | YULE | RMI
  ],
  orderby = [SUM | MAX | AVG],
  dfmin = 0,
  dfmax = 0
)
learner.close ()

cl_for: train for which classifier, if not specified this features used default for every classifiers haven’t own feature set. So train () can be called repeatly for each classifiers
select: number of features if value is > 1 or ratio to all terms. Generally it might be not over 7,000 features for classifying web pages or news articles into 20 classes.
selector: mathemetical term scoring alorithm to selecting features considering relation between term and term / term and label. Also DF, and term frequency (TF) etc.
orderby: final scoring method. one of sum, max, average value
dfmin, dfmax: In spite of it had been already removed by build(), it can be also additional removed for optimal result for specific classifier

If you remove training data for specific classifier,

mdl = delune.model (indexdir, MODIFY)
learner = mdl.get_learner ()

learner.untrain (cl_for)
learner.close ()

Finding Best Training Options

Generally, differnce attibutes of data set, it hard to say which options are best. It is stongly necessary number of times repeating process between train () and guess () for best result and that’s not easy process.

index ()
build ()
train (initial options)
measure results with guess ()
append additional documents, build () if need
train (another options)
measure results again with guess ()
…
find best optiaml training options with your data set

For getting result accuracy, your pre-requisite data should be splitted into train set for tran () and test set for guess () to measure like precision and recall.

For example, there were 27,000 web pages to training set and 2,700 test set for classifying to spam page or not. Total indexed terms are 199,183 and I eliminated 94% terms by DF < 30 or DF > 7000 and remains only 10,221 terms.

F: selected features by OR(Odds Ratio) MAX
NB: NAIVEBAYES, RO: ROCCHIO
Numbers means: Matched % Ratio Excluding Unclassified (Unclassified Documents)
- F 7,000: NB 97.2 (1,100), RO 95.4 (50)
- F 5,000: NB 97.4 (493), RO 94.8 (69)
- F 4,000: NB 96.6 (282), RO 91.6 (96)
- F 3,000: NB 93.2 (214), RO 86.2 (151)
- F 2,000: NB 89.4 (293), RO 80.1 (281)

Which do you choice? In my case, I choose F 5,000 with ROCCHIO cause of low unclassified ratio. But if speed was more important I might choice F 3,000 with NAIVEBAYES.

Anyway everything is done, and if you has been found optimal parameters, you can optimize classifier model.

mdl = delune.model (indexdir, delune.MODIFY, an)
learner = mdl.get_learner ()
learner.optimize ()
learner.close ()

Note that once called optimize (),

you cannot add additional training documents
you cannot rebuild corpus by calling build () again
but you can still call train () any time

The reason why when low/high DF terms are eliminated by optimize (), related index files will be also shrinked unrecoverably for performance. Then if these works are needed, you should do from step I again.

If you don’t do optimize it make SIMILARITY and ROCCHIO classifiers inefficient (also it will be NOT influence to NAIVEBAYES, TFDIF, FEATUREVOTE classifiers). But you think it’s more important retraining regulary rather than speed performance, you should not optimize.

Feature Selecting Methods

CHI2 = Chi Square Statistic

GSS = GSS Coefficient

DF = Document Frequency

CF = Category Frequency

NGL = NGL

MI = Mutual Information

TFIDF = Term Frequecy - Inverted Document Frequency

IG = Information Gain

OR = Odds Ratio

OR4P = Kind of Odds Ratio(? can’t remember)

RS = Relevancy Score

LOR = Log Odds Ratio

COS = Cosine Similarity

PPHI = Pearson’s PHI

YULE = Yule

RMI = Residual Mutual Information

I personally prefer OR, IG and GSS selectors with MAX method.

Classifier

Finally,

classifier = mdl.get_classifier ()
classifier.quess (
  qs,
  lang = "un",
  cl = [
    NAIVEBAYES (Default) | FEATUREVOTE | ROCCHIO |
    TFIDF | SIMILARITY | META | MULTIPATH
  ],
  top = 0,
  cond = ""
)

classifier.cluster (
  qs,
  lang = "un"
)

classifier.close ()

qs: full text stream to classify
lang
cl: which classifer, META is default
top: how many high scored classified results, default is 0, means high scored result(s) only
cond: conditional document selecting query. Some classifier execute calculating with lots of documents like ROCCHIO and SIMILARITY, so it’s useful shrinking number of documents. This only work when you put additional searchable fields using labeled_document.field (…).

Implemented Classifiers

NAIVEBAYES: Naive Bayes Probablility, default guessing

FEATUREVOTE: Feature Voting Classifier

ROCCHIO: Rocchio Classifier

TFIDF: Max TDIDF Score

SIMILARITY: Max Cosine Similarity

MULTIPATH: Experimental Multi Path Classifier, terms of classifying document will be clustered into multiple sets by co-word frequency before guessing

META: merging and decide with multiple results guessed by NAIVEBAYES, FEATUREVOTE, ROCCHIO Classifiers

If you need speed most of all, NAIVEBAYES is a good choice. NAIVEBAYES is an old theory but it still works with very high performance at both speed and accuracy if given proper training set.

More detail for each classifier alorithm, googling please.

Optimizing Each Classifiers

For give some detail options to a classifier you can use setopt (classfier, option name = option value,…).

classifier = mdl.get_classifier ()
classifier.setopt (delune.ROCCHIO, topdoc = 200)

SIMILARITY, ROCCHIO classifiers basically have to compare with entire indexed document documents, but DeLune can compare with selected documents by ‘topdoc’ option. These number of documents will be selected by high TFIDF score for classifying performance reason. Default topdoc value is 100. If you set to 0, DeLune will compare with all documents have one of features at least. But on my experience, there’s no critical difference except speed performance.

Currently available options are:

ALL
- verbose = False
ROCCHIO
- topdoc = 100
MULTIPATH
- subcl = [ FEATUREVOTE (default) | NAIVEBAYES | ROCCHIO ]
- scoreby = [ IG (default) | MI | OR | R ]
- choiceby = [ AVG (default) | MIN ], when scorring between term and each terms in cluster, which do you want to use value
- threshold = 1.0, float value for creating new cluster and this value is measured with Information Gain and value range is somewhat different by number of training documents.

Document Cluster

TODO

cluster = mdl.get_dcluster ()

Term Cluster

TODO

cluster = mdl.get_tcluster ()

Handling Multiple Searchers & Classifiers

In case of creating multiple searchers and classifers, delune.task might be useful. Here’s a script named ‘config.py’

import delune
from delune.lib import logger

def start_delune (numthreads, logger):
  delune.configure (numthreads, logger)

  analyzer = delune.standard_analyzer ()
  col = delune.collection ("./data1", delune.READ, analyzer)
  delune.assign ("data1", col.get_searcher (max_result = 2000))

  analyzer = delune.standard_analyzer (max_term = 1000, stem = 2)
  mdl = delune.model ("./data2", delune.READ, analyzer)
  delune.assign ("data2", mdl.get_classifier ())

The first argument of assign () is alias for searcher or classifier.

If you call config.start_delune () at any script, you can just import delune and use it at another python scripts.

import delune

delune.query ("data1", "mozart sonatas")
delune.guess ("data2", "mozart sonatas")

# close and resign
delune.close ("data1")
delune.resign ("data1")

At the end of you app, call delune.shutdown ()

import delune

delune.shutdown ()

API Export Using Skitai

New in version 0.12.14

You can use RESTful API with Skitai-Saddle.

Copy and save below code to app.py.

import os
import delune
import skitai

if __name__ == "__main__":
  pref = skitai.pref ()
  pref.use_reloader = 1
  pref.debug = 1

  config = pref.config
  config.sched = "0/5 * * * *"
  config.local = "http://127.0.0.1:5000/v1"

  config.remote = os.environ.get ("DELUNE_ORIGIN")
  config.enable_mirror = config.remote

  config.resource_dir = skitai.joinpath ('resources')
  config.enable_index = True

  config.logpath = None
  skitai.trackers ('delune:collection')
  skitai.mount ("/v1", delune, "app", pref)
  skitai.run (
    workers = 2,
    port = 5000,
    logpath = config.logpath
  )

This app run indexing job for every 5 minutes at backgound.

If you want read-only replica, set origin server at your account environement,

export DELUNE_ORIGIN=http://192.168.1.200:5000/v1

All collections will be replicated from http://192.168.1.200:5000/v1 API for every 5 minutes.

Then run app.

python app.py -v

Here’s example of client side indexing script using API.

colopt = {
  'data_dir': [
      'models/0/books',
      'models/1/books',
      'models/2/books'
  ],
  'analyzer': {
      "ngram": 0,
      "stem_level": 1,
      "strip_html": 0,
      "make_lower_case": 1
  },
  'indexer': {
      'force_merge': 0,
      'max_memory': 10000000,
      'max_segments': 10,
      'lazy_merge': (0.3, 1),
  },
  'searcher': {
    'max_result': 2000,
    'num_query_cache': 200
  }
}

import requests
session = requests.Session ()

# check current collections
r = session.get ('http://127.0.0.1:5000/v1/').json ()
if 'books' not in r ["collections"]:
  # collections dose not exist, then create
  session.post ('http://127.0.0.1:5000/v1/books', colopt)

dbc = db.connect (...)
cursor = dbc.curosr ()
cursor.execute (...)

numdoc = 0
while 1:
  row = cursor.fetchone ()
  if not row: break
  doc = delune.document (row._id)
  doc.content ({"author": row.author, "title": row.title , "abstract": row.abstract})
  doc.snippet (row.abstract)
  doc.field ('default', "%s %s" % (row.title, row.abstract), delune.TEXT, 'en')
  doc.field ('title', row.title, delune.TEXT, 'en')
  doc.field ('author', row.author, delune.STRING)
  doc.field ('isbn', row.isbn, delune.STRING)
  doc.field ('year', row.year, delune.INT16)

  session.post ('http://127.0.0.1:5000/v1/books/documents', doc.as_json ())
  numdoc += 1
  if numdoc % 1000:
      session.get ('http://127.0.0.1:5000/v1/books/commit')

cursor.close ()
dbc.close ()

all APIs are:

# add new collection with options
session.post ('http://127.0.0.1:5000/v1", colopt)
# get collection status and options
session.get ('http://127.0.0.1:5000/v1/books")
# modify collection options
session.patch ('http://127.0.0.1:5000/v1/books", colopt)
# remove collection but preserve all index files
session.remove ('http://127.0.0.1:5000/v1/books")
# remove collection with all index files
session.remove ('http://127.0.0.1:5000/v1/books?side_effect=data")
# undo remove collection with all index files
session.get ('http://127.0.0.1:5000/v1/books?side_effet=undo")

# get collection locks
session.get ('http://127.0.0.1:5000/v1/books/locks")
# create 'custom' lock
session.post ('http://127.0.0.1:5000/v1/books/locks/custom")
# delete 'custom' lock
session.delete ('http://127.0.0.1:5000/v1/books/locks/custom")

# add new document
session.post (
  'http://127.0.0.1:5000/v1/books/documents",
  doc.as_json ()
)
# modify document
session.patch (
  'http://127.0.0.1:5000/v1/books/documents/" + row._id,
  doc.as_json ()
)
# delete document by document_id
session.delete ('http://127.0.0.1:5000/v1/books/documents/" + row._id)

# truncate all documents from collection
session.delete ('http://127.0.0.1:5000/v1/books/documents?truncate_confirm=books')

# search
session.get ('http://127.0.0.1:5000/v1/books/search?q=title:book")
# guess
session.get ('http://127.0.0.1:5000/v1/books/guess?q=title:book")
# delete documents by search
session.delete ('http://127.0.0.1:5000/v1/books/search?q=title:book")

# commit document queue
session.get ('http://127.0.0.1:5000/v1/books/commit')
# remove document queue
session.get ('http://127.0.0.1:5000/v1/books/rollback')

Note: DeLune doesn’t check uniqueness of document ID, it means if you post multiple documents with same document ID, delune will index all of them with regardless document ID. If you want to keep uniqueness, you SHOULD use ‘patch’ method NOT ‘post’.

For more detail about API, see app.py.

Change Log

DeLune

0.1

change package name from Wissen to DeLune

Wissen

0.13

fix using lock

add truncate collection API

fix updating document

change replicating way to use sticky session connection with origin server

fix file creation mode on posix

fix using lock with multiple workers

change delune.document method names

fix index queue file locking

0.12

add biword arg to standard_analyzer

change export package name from appack to package

add Skito-Saddle app

fix analyzer.count_stopwords return value

change development status to Alpha

add delune.assign(alias, searcher/classifier) and query(alias), guess(alias)

fix threads count and memory allocation

add example for Skitai-Saddle app to mannual

0.11

fix HTML strip and segment merging etc.

add MULTIPATH classifier

add learner.optimize ()

make learner.build & learner.train efficient

0.10 - change version format, remove all str*_s ()

0.9 - support Python 3.x

0.8 - change license from BSD to GPL V3

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.4b27 pre-release

Oct 2, 2022

0.4b26 pre-release

Oct 1, 2022

0.4b25 pre-release

Sep 29, 2022

0.4b23 pre-release

Aug 5, 2022

0.4b22 pre-release

Aug 5, 2022

0.4b21 pre-release

Apr 25, 2022

0.4b20 pre-release

Nov 7, 2021

0.4b18 pre-release

May 3, 2021

0.4b17 pre-release

May 3, 2021

0.4b16 pre-release

May 2, 2021

0.4b15 pre-release

May 1, 2021

0.4b14 pre-release

Sep 5, 2020

0.4b13 pre-release

Aug 16, 2020

0.4b12 pre-release

Oct 25, 2019

0.4b11 pre-release

Oct 25, 2019

0.4b10 pre-release

Jun 5, 2019

0.4b9 pre-release

Mar 24, 2019

0.4b8 pre-release

Feb 25, 2019

0.4b7 pre-release

Feb 15, 2019

0.4b5 pre-release

Jan 16, 2019

0.4b4 pre-release

Jun 3, 2018

0.4b3 pre-release

Jun 3, 2018

0.4b2 pre-release

Jun 2, 2018

0.4b1 pre-release

Jun 2, 2018

0.3.2.1

Feb 9, 2019

0.3.2

Feb 7, 2019

0.3.1.7

Feb 10, 2018

0.3.1.6

Oct 25, 2017

0.3.1.5

Oct 24, 2017

0.3.1.4

Oct 23, 2017

0.3.1.3

Sep 26, 2017

0.3.1.2

Sep 24, 2017

0.3.1.1

Sep 23, 2017

0.3.1

Sep 23, 2017

0.3.1b9 pre-release

Sep 23, 2017

0.3.1b8 pre-release

Sep 22, 2017

0.3.1b7 pre-release

Sep 19, 2017

0.3.1b6 pre-release

Sep 18, 2017

0.3.1b5 pre-release

Sep 17, 2017

0.3.1b4 pre-release

Sep 17, 2017

0.3.1b3 pre-release

Sep 16, 2017

0.3.1b2 pre-release

Sep 16, 2017

0.3.1b1 pre-release

Sep 16, 2017

0.3.0.2

Sep 15, 2017

0.3b3 pre-release

Sep 15, 2017

0.3b2 pre-release

Sep 15, 2017

0.3b1 pre-release

Sep 15, 2017

0.2.1.1b1 pre-release

Sep 14, 2017

This version

0.2.1

Sep 14, 2017

0.2

Sep 14, 2017

0.2b1 pre-release

Sep 14, 2017

0.1

Sep 13, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

delune-0.2.1.tar.gz (1.7 MB view hashes)

Uploaded Sep 14, 2017 Source

Built Distribution

delune-0.2.1-cp35-cp35m-win_amd64.whl (2.6 MB view hashes)

Uploaded Sep 14, 2017 CPython 3.5m Windows x86-64

Hashes for delune-0.2.1.tar.gz

Hashes for delune-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`b90c883cf760d589923ba454600ff9d14747f4d071460821f8a2942f25b59e91`
MD5	`23d149bdba30fc6c446806c82b65d448`
BLAKE2b-256	`0f32805594aa74ee1e0a0451f0dd2961108bab2702621fffe0029bd44c0f95ea`

Hashes for delune-0.2.1-cp35-cp35m-win_amd64.whl

Hashes for delune-0.2.1-cp35-cp35m-win_amd64.whl
Algorithm	Hash digest
SHA256	`71525911230fb2a69f53c98a06066c0ab16a1e45a2bdd79127a2935f6456a77c`
MD5	`7431b1f39d0292051b6fbb811e0959c7`
BLAKE2b-256	`522123e7d3d6c80321d1232adeaf1b85da7aff4e0bf341509d10154556edb330`