Skip to main content

Wissen Full-Text Search & Classify Engine

Project description

Copyright (c) 2015 by Hans Roh

License: GPLv3

Introduce

Wissen Search & Classify Engine is a simple search engine mostly written in Python and C in year 2008.

At that time, I would like to study Lucene earlier version. But I don’t like Java, so I had studied with Lupy and CLucene. And I also had maden my own search engine for excercise.

Its file format, numeric compressing algorithm, indexing process are quiet similar with Lucene. But I got tired reverse engineering, so query and result-fetching parts was built from my imagination. As a result it’s entirely unorthodox and possibly very inefficient.

But It’s relatively simple and easy modifiable, I has been using some works.

Install

sudo pip install wissen

Quick Start

Full Text Index and Search

import wissen

# indexing
analyzer = wissen.standard_analyzer (max_term = 3000)
col = wissen.collection ("./col", wissen.CREATE, analyzer)
indexer = col.get_indexer ()

song = u"violin sonata in c k.301"
composer = u"wolfgang amadeus mozart"
birth = 1756
home = u"50.665629/8.048906" # Lattitude / Longitude of Salzurg
genre = u"01011111" # (rock serenade jazz piano symphony opera quartet sonata)

document = wissen.document ()
document.set_content ([song, composer])
document.set_auto_snippet (song)

document.add_field ("default", song, wissen.TEXT)
document.add_field ("composer", composer, wissen.TEXT)
document.add_field ("birth", birth, wissen.INT16)
document.add_field ("genre", genre, wissen.BIT8)
document.add_field ("home", home, wissen.COORD)

indexer.add_document (document)
indexer.close ()

# searching
analyzer = wissen.standard_analyzer (max_term = 8)
col = wissen.collection ("./col", wissen.READ, analyzer)
searcher = col.get_searcher ()
print searcher.query (u'violin', offset = 0, fetch = 2, sort = "tfidf", summary = 30)
searcher.close ()

Result will be like this:

{
    'code': 200,
    'time': 0,
    'total': 1
    'result': [
            [
                            [u'violin sonata in c k.301', u'wofgang amadeus mozart'], # content
                            '<b>violin</b> sonata in c k.301', # auto snippet
                            14, 0, 0, 0 # additional info
            ]
    ],
    'sorted': [None, 0],
    'regex': 'violin|violins',
}

Full Text Classification

import wissen

# learning
mdl = wissen.model ("./mdl", wissen.CREATE)
learner = mdl.get_learner ()

document = wissen.labeled_document ("Play Golf", "cloudy windy warm")
learner.add_document (document)
document = wissen.labeled_document ("Play Golf", "windy sunny warm")
learner.add_document (document)
document = wissen.labeled_document ("Go To Bed", "cold rainy")
learner.add_document (document)
document = wissen.labeled_document ("Go To Bed", "windy rainy warm")
learner.add_document (document)

learner.build (min_df = 0) # build corpus
learner.train (wissen.ALL, prune_df_max = 100, selector = wissen.CHI2, select_way = wissen.MAX, select_ratio = 0.99)

learner.close ()


# gusessing

mdl = wissen.model ("./mdl")
classifier = mdl.get_classifier ()
print classifier.guess ("rainy cold")
print classifier.guess ("rainy cold", wissen.FEATUREVOTE)
print classifier.guess ("rainy cold", wissen.NAIVEBAYES)
print classifier.guess ("rainy cold", wissen.TFIDF)
print classifier.guess ("rainy cold", wissen.SIMILARITY)
classifier.close ()

Result will be like this:

{
    'code': 200,
    'total': 1,
    'time': 5,
    'result': [
            ('Go To Bed', 1.0)
    ]
}

Searchable Field Types

  • TEXT: analyzable full-text

  • TERM: analyzable full-text but position data will not be indexed

  • STRING: exactly string match like nation codes

  • LIST: comma seperated STRING

  • COORDn, n=4,6,8 decimal precision: latitude, longititude, result-sortable

  • BITn, n=8,16,24,32,40,48,56,64: bitwise operation

  • INTn, n=8,16,24,32,40,48,56,64: range, result-sortable

For more information, see wissen/__init__.py

Stemming & N-Gram For International Languages

Wissen has some kind of stemmers and n-gram methods for international languages and can use them by this way:

analyzer = standard_analyzer (ngram = True, stem_level = 1)
col = wissen.collection ("./col", wissen.CREATE, analyzer)
indexer = col.get_indexer ()
document.add_field ("default", song, wissen.TEXT, lang = "en")

The default strategy of standard_analyzer is (ngram = True, stem_level = 1):

  • Step 1: index to bigram for CJK (Chinese, Japanese, Korean)

  • Step 2: stemming text by lang parameter if lang has stemmer

  • Step 3: index to tri-gram for the other langugaes

Automatic Bi-Gram

If ngram is set to True, These languages will be indexed with bi-gram.

  • cn: Chinese

  • jp: Japanese

  • ko: Korean

Implemented Stemmers

Except English stemmer, all stemmers can be obtained at IR Multilingual Resources at UniNE.

  • ar: Arabic

  • de: German

  • en: English

  • es: Spanish

  • fi: Finnish

  • fr: French

  • hu: Hungarian

  • it: Italian

  • pt: Portuguese

  • sv: Swedish

Query Syntax

  • violin composer:mozart birth:1700~1800

  • violin allcomposer:wolfgang mozart

  • violin -sonata birth:~1800

  • violin -composer:mozart

  • violin or piano genre:00001101/all

  • violin or ((piano composer:mozart) genre:00001101/any)

  • (violin or ((allcomposer:mozart wolfgang) -amadeus)) sonata (genre:00001101~none home:50.6656,8.0489~10)

  • “violin sonata” genre:00001101/none

  • “violin^3 piano” -composer:”ludwig van beethoven”

  • “violin sonata” genre:00001101/none home:50.6656/8.0489~10 # within 10M from (50.6656, 8.0489)

Full-Text Classifiers

  • META: default guessing, merging results with below classifiers

  • NAIVEBAYES

  • FEATUREVOTE

  • TFIDF

  • SIMILARITY

For more information, see wissen/classifier/classifiers/*.py

Note for Multi-threading and Multiple Collection

Indexing & learning support only single thread.

Searching & guessing is thread-safe, only if you use threads-pool way.

If you create 8 threads for search, you should configure wissen.

wissen.configure (numthread = 8)

Now you can open multiple collections (or models) and access with 8 threads.

If 9th thread try to access to wissen, it will raise error.

Core Class & Function Prototypes

# logger
from wissen.lib import logger
logger.screen_logger ()
logger.rotate_logger ("/var/log/wissen")

# for multi threadiong env, init wissen
wissen.configure (numthread, logger, io_buf_size = 8192, mem_limit = 256, max_segment_size = 0)

#fianlly,
wissen.shutdown ()

wissen.standard_analyzer (max_term = 8, numthread = 1, **karg)

karg will be:
  ngram = True or False
  stem_level = 1 or 2 (2 is only applied to English Language)
  stopwords_case_sensitive = True or False
  ngram_no_space = True or False
  strip_html = True or False

col = wissen.collection (indexdir, mode = wissen.READ, analyzer = None, logger = None)
col.setopt (key = value, ...)

keys and default values will be:
  merge_factor = int
  force_merge = True or False
  max_memory = 10000000 (10Mb)
  optimize = True or False
  max_result = 2000
  num_query_cache = 200

mdl = wissen.model (indexdir, mode = wissen.READ, analyzer = None, logger = None)
mdl.setopt (key = value, ...)

keys and default values will be:
  use_features_top = 0 # use all features

Documentation

Not yet.

Change Log

0.10 - change version format, remove all str*_s ()

0.9.0.5 - fix long long int, bit type

0.9.0.3 - fix logger encoding

0.9.0.2 - fix snippet-making

0.9.0.1 - support Python 3.x

0.8.0.13 - change license from BSD to GPL V3

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wissen-0.10.2.tar.gz (1.8 MB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page