Skip to main content

GISMO is a NLP tool to rank and organize a corpus of documents according to a query.

Project description

Gismo logo

A Generic Information Search… With a Mind of its Own!

https://img.shields.io/pypi/v/gismo.svg Build Status Documentation Status Code Coverage

GISMO is a NLP tool to rank and organize a corpus of documents according to a query.

Gismo stands for Generic Information Search… with a Mind of its Own.

Features

Gismo combines three main ideas:

  • TF-IDTF: a symmetric version of the TF-IDF embedding.

  • DIteration: a fast, push-based, variant of the PageRank algorithm.

  • Fuzzy dendrogram: a variant of the Louvain clustering algorithm.

Quickstart

Install gismo:

$ pip install gismo

Import gismo in a Python project:

import gismo as gs

To get the hang of a typical Gismo workflow, you can check the Toy Example notebook. For more advanced uses, look at the other tutorials or directly the reference section.

Credits

Thomas Bonald, Anne Bouillard, Marc-Olivier Buob, Dohy Hong.

This package was created with Cookiecutter and the francois-durand/package_helper project template.

History

X.X.X (TODO-List)

  • Rethink distortion on both vectors normalization and IDTF/query trade-off.

  • Accelerate similarity computation (currently sklearn-based) in clustering.

0.4.X (2023-0X-XX) (tentative)

  • Context manager for FileSource (e.g. with FileSource(...) as source:)

  • 3.9 compatibility issues rechecked

  • Wheels

  • Minor change in test_dblp.py

0.4.3 (2022-12-26)

  • Refresh dependencies, compatibilities, and such.

  • Gismo is tested up to Python 3.10.

  • Patch sklearn change of API (ngram_range must be a tuple, get_feature_names has been renamed get_feature_names_out)

  • Updates MixInIO logic: you now save with the dump method and load with the load class method.

  • Package management now uses Github actions.

0.4.2 (2021-05-05)

Minor patch

  • Signature of the Gismo rank method changed to allow to enter directly a query vector instead of a string query (useful if one wants to craft a custom query vector).

  • Original source of the Reuters 50/50 dataset was discontinued; changed to an alternate source.

  • Fix change in spacy API

0.4.1 (2020-11-25)

Minor update.

  • DBLP API modified to you can specify the set of fields you want to retrieve.

  • Minor update in doctests.

  • Python 3.9 compatibility added.

0.4.0 (2020-07-21)

0.4 is a big update. Lot of things added, lot of things changed.

  • New API for Gismo runtime parameters (see new parameters module for details). Short version:
    • gismo = Gismo(corpus, embedding, alpha=0.85): create a gismo with damping factor set to 0.85 instead of default value.

    • gismo.parameters.alpha = 0.85: set the damping factor of the gismo to 0.85.

    • gismo.rank(query, alpha=0.85): makes a query with damping factor temporarily set to 0.85.

  • Landmarks! Half Corpus, half Gismo, the Landmarks class can simplify many analysis tasks.
    • Landmarks are (small) corpus where each entry is augmented with the computation of an associated gismo query;

    • Landmarks can be used to refine the analysis around a part of your data;

    • They can be used as soft and fast classifiers.

    • Landmarks’ runtime parameters follow the same approach than for Gismo instances (cf above).

    • See the dedicated tutorial to learn more!

  • Documentation summer cleaning.

  • query_distortion parameter (reshape subspace for clustering) is renamed distortion and is now a float instead of a bool (e.g. you can apply distortion in a non-binary way).

  • Full refactoring of get_*** and post_*** methods and objects.
    • The good news is that they are now more natural, self-describing, and unified.

    • The bad news is that there is no backward-compatibility with previous Gismo versions. Hopefully this refactoring will last for some time!

  • Gismo logo added!

0.3.1 (2020-06-12)

  • New dataset: Reuters C50

  • New module: sentencizer

0.3.0 (2020-05-13)

  • dblp module: url2source function added to directly load a small dblp source in memory instead of using a FileSource approach.

  • Possibility to disable query distortion in gismo.

  • XGismo class to cross analyze embeddings.

  • Tutorials updated

0.2.5 (2020-05-11)

  • auto_k feature: if not specified, a query-dependent, reasonable, number of results k is estimated.

  • covering methods added to gismo. It is now possible to use get_covering_* instead of get_ranked_* to maximize coverage and/or eliminate redundancy.

0.2.4 (2020-05-07)

  • Tutorials for ACM and DBLP added. After cleaning, there is currently 3 tutorials:
    • Toy model, to get the hang of Gismo on a tiny example,

    • ACM, to play with Gismo on a small example,

    • DBLP, to play with a large dataset.

0.2.3 (2020-05-04)

  • ACM and DBLP dataset creation added.

0.2.2 (2020-05-04)

  • Notebook tutorials added (early version)

0.2.1 (2020-05-03)

  • Actual code

  • Coverage badge

0.1.0 (2020-04-30)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gismo-0.4.3.tar.gz (73.7 kB view hashes)

Uploaded Source

Built Distribution

gismo-0.4.3-py2.py3-none-any.whl (45.0 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page