octis

octis: a library for Optimizing and Comparing Topic Models.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

https://github.com/MIND-Lab/OCTIS/workflows/Python%20package/badge.svg

OCTIS (Optimizing and Comparing Topic models Is Simple) aims at training, analyzing and comparing Topic Models, whose optimal hyper-parameters are estimated by means of a Bayesian Optimization approach.

Install

You can install OCTIS with the following command:

pip install octis

You can find the requirements in the requirements.txt file.

Features

We provide a set of state-of-the-art preprocessed text datasets (or you can preprocess your own dataset)
We provide a set of well-known topic models (both classical and neurals), or you can integrate your own model
You can evaluate your model using several state-of-the-art evaluation metrics
You can optimize the hyperparameters of the models with respect to a given metric using Bayesian Optimization
We provide a simple web dashboard for starting and controlling the optimization experiments

Get a preprocessed dataset

To acquire a dataset you can use one of the built-in sources.

from octis.dataset.dataset import Dataset
 dataset = Dataset()
 dataset.load("octis/preprocessed_datasets/m10")

Or use your own.

import octis.preprocessing.sources.custom_dataset as source
dataset = source.retrieve("path\to\dataset")

A custom dataset must have a document for each line of the file. Datasets can be partitioned in train and test sets.

Preprocess

To preprocess a dataset Initialize a Pipeline_handler and use the preprocess method.

from octis.preprocessing.pipeline_handler import Pipeline_handler

pipeline_handler = Pipeline_handler(dataset) # Initialize pipeline handler
preprocessed = pipeline_handler.preprocess() # preprocess

preprocessed.save("dataset_folder") # Save the preprocessed dataset

For the customization of the preprocess pipeline see the optimization demo example in the examples folder.

Train a model

To build a model, load a preprocessed dataset, customize the model hyperparameters and use the train_model() method of the model class.

from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA

# Load a dataset
dataset = Dataset()
dataset.load("dataset_folder")

model = LDA(num_topics=25)  # Create model
model_output = model.train_model(dataset) # Train the model

If the dataset is partitioned, you can choose to:

Train the model on the training set and test it on the test documents
Train the model on the training set and update it with the test set
Train the model with the whole dataset, regardless of any partition.

Evaluate a model

To evaluate a model, choose a metric and use the score() method of the metric class.

from octis.evaluation_metrics.diversity_metrics import TopicDiversity

# Set metric parameters
td_parameters ={'topk':10}

metric = TopicDiversity(td_parameters) # Initialize metric
topic_diversity_score = metric.score(model_output) # Compute score of the metric

Optimize a model

To optimize a model you need to select a dataset, a metric and the search space of the hyperparameters to optimize.

from octis.optimization.optimizer import Optimizer

search_space = {
"alpha": Real(low=0.001, high=5.0),
"eta": Real(low=0.001, high=5.0)
}

number_of_call=5
model_runs=3
save_path="results"
# Initialize an optimizer object and start the optimization.
optimizer=Optimizer()
OptObject=optimizer.optimize(model,dataset, npmi,search_space,
                                number_of_call=number_of_call,
                                model_runs=model_runs,
                                save_path=save_path)
#save the results of th optimization in a csv file
OptObject.save_to_csv("results.csv")

The result will provide best-seen value of the metric with the corresponding hyperparameter configuration, and the hyperparameters and metric value for each iteration of the optimization. To visualize this information, you have to set ‘plot’ attribute of Bayesian_optimization to True.

You can find more here: optimizer README

Examples and Tutorials

Our Colab Tutorials:

Name	Link
How to build a topic model and evaluate the results.
Optimizing a topic model (Example with ETM and 20Newsgroup)
Optimizing a topic model (Example with LDA and M10)

Available Models

AVITM
CTM
ETM
HDP
LDA
LSI
NMF
NeuralLDA
ProdLDA

Available Datasets

20Newsgroup
BBC News
DBLP
M10

Disclaimer

Similarly to TensorFlow Datasets and HuggingFace’s nlp library, we just downloaded and prepared public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset’s license and to cite the right owner of the dataset.

If you’re a dataset owner and wish to update any part of it, or do not want your dataset to be included in this library, please get in touch through a GitHub issue.

If you’re a dataset owner and wish to include your dataset in this library, please get in touch through a GitHub issue.

Implement your own Model

Models inherit from the class Abstract_Model defined in models/model.py . To build your own model your class must override the train_model(self, dataset, hyperparameters) method which always require at least a Dataset object and a Dictionary of hyperparameters as input and should return a dictionary with the output of the model as output.

To better understand how a model work, let’s have a look at the LDA implementation. The first step in developing a custom model is to define the dictionary of default hyperparameters values:

hyperparameters = {'corpus': None, 'num_topics': 100,
    'id2word': None, 'alpha': 'symmetric',
    'eta': None, # ...
    'callbacks': None}

Defining the default hyperparameters values allows users to work on a subset of them without having to assign a value to each parameter.

The following step is the train_model() override:

def train_model(self, dataset, hyperparameters={}, top_words=10):

The LDA method requires a dataset, the hyperparameters dictionary and an extra (optional) argument used to select how many of the most significative words track for each topic.

With the hyperparameters defaults, the ones in input and the dataset you should be able to write your own code and return as output a dictionary with at least 3 entries:

topics: the list of the most significative words foreach topic (list of lists of strings).
topic-word-matrix: an NxV matrix of weights where N is the number of topics and V is the vocabulary length.
topic-document-matrix: an NxD matrix of weights where N is the number of topics and D is the number of documents in the corpus.

if your model support the training/test partitioning it should also return:

test-topic-document-matrix: the document topic matrix of the test set.

In case the model isn’t updated with the test set. Or:

test-topics: the list of the most significative words foreach topic (list of lists of strings) of the model updated with the test set.
test-topic-word-matrix: an NxV matrix of weights where N is the number of topics and V is the vocabulary length of the model updated with the test set.
test-topic-document-matrix: an NxD matrix of weights where N is the number of topics and D is the number of documents in the corpus of the model updated with the test set.

If the model is updated with the test set.

Dashboard

OCTIS includes a user friendly graphical interface for creating, monitoring and viewing experiments. Following the implementation standards of datasets, models and metrics the dashboard will automatically update and allow you to use your own custom implementations.

To run rhe dashboard, while in the project directory run the following command:

python OCTIS/dashboard/server.py

The browser will open and you will be redirected to the dashboard. In the dashboard you can:

Create new experiments organized in batch
Visualize and compare all the experiments
Visualize a custom experiment
Manage the experiment queue

Team

Project and Development Lead

Silvia Terragni <s.terragni4@campus.unimib.it>
Elisabetta Fersini <elisabetta.fersini@unimib.it>
Antonio Candelieri <antonio.candelieri@unimib.it>

Current Contributors

Pietro Tropeano <p.tropeano1@campus.unimib.it> Framework architecture, Preprocessing, Topic Models, Evaluation metrics and Web Dashboard
Bruno Galuzzi <bruno.galuzzi@unimib.it> Bayesian Optimization
Silvia Terragni <s.terragni4@campus.unimib.it> Overall project

Past Contributors

Lorenzo Famiglini <l.famiglini@campus.unimib.it> Neural models integration
Davide Pietrasanta <d.pietrasanta@campus.unimib.it> Bayesian Optimization

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

1.13.1

Aug 15, 2023

1.13.0

Jun 27, 2023

1.12.1

May 19, 2023

1.12.0

May 3, 2023

1.11.1

Feb 2, 2023

1.11.0

Jan 7, 2023

1.10.4

May 20, 2022

1.10.3

Feb 20, 2022

1.10.2

Dec 20, 2021

1.10.1

Dec 8, 2021

1.10.0

Nov 21, 2021

1.9.0

Sep 27, 2021

1.8.3

Jul 26, 2021

1.8.2

Jul 25, 2021

1.8.1

Jul 8, 2021

1.8.0

Jun 18, 2021

1.7.1

Jun 9, 2021

1.7.0

Jun 9, 2021

1.6.0

May 20, 2021

1.5.0

May 12, 2021

1.3.0

Apr 25, 2021

1.2.1

Apr 21, 2021

1.2.0

Apr 20, 2021

1.1.1

Apr 19, 2021

1.1.0

Apr 18, 2021

1.0.2

Apr 16, 2021

1.0.1

Apr 16, 2021

0.4.0

Apr 15, 2021

0.3.0

Apr 10, 2021

0.2.0

Mar 30, 2021

0.1.2.1

Mar 15, 2021

This version

0.1.2

Mar 15, 2021

0.1.1

Mar 15, 2021

0.1.0

Mar 11, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

octis-0.1.2.tar.gz (30.1 MB view hashes)

Uploaded Mar 15, 2021 Source

Built Distribution

octis-0.1.2-py2.py3-none-any.whl (99.6 kB view hashes)

Uploaded Mar 15, 2021 Python 2 Python 3

Hashes for octis-0.1.2.tar.gz

Hashes for octis-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`29a9debe355bc2bf378c687c5b33783c5b6c5a5b401f4b26848788f87cbc3b3f`
MD5	`9bf62f37b4eaae3b99fc9497116bd733`
BLAKE2b-256	`f2650b4e2a69442b51b62bd554931bb67ef64155fdb8b2dc6d3c58f2a30fa7c1`

Hashes for octis-0.1.2-py2.py3-none-any.whl

Hashes for octis-0.1.2-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`5b8204aa1e2460e553404dc3dd9d8186f5e36fb24e1d3689627cca02d99ec0ab`
MD5	`8387cee27873c31e927e5fca8e19b842`
BLAKE2b-256	`544a4c7bb2785958a60760f91017894e03b9953cb60174418873b49a44728ec6`