contextualized-topic-models

Contextualized Topic Models

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

Contextualized Topic Models

Contextualized Topic Models

Free software: MIT license
Documentation: https://contextualized-topic-models.readthedocs.io.

Super big shout-out to Stephen Carrow for creating the awesome https://github.com/estebandito22/PyTorchAVITM package from which we constructed the foundations of this package. We are happy to redistribute again this software under the MIT License.

Features

Combines BERT and Neural Variational Topic Models
Two different methodologies: combined, where we combine BoW and BERT embeddings and contextual, that uses only BERT embeddings
Includes methods to create embedded representations and BoW
Includes evaluation metrics

Quick Guide

Install the package using pip

pip install -U contextualized_topic_models

The contextual neural topic model can be easily instantiated using few parameters (although there is a wide range of parameters you can use to change the behaviour of the neural topic model. When you generate embeddings with BERT remember that there is a maximum length and for documents that are too long some words will be ignored.

from contextualized_topic_models.models.cotm import COTM
from contextualized_topic_models.utils.data_preparation import VocabAndTextFromFile
from contextualized_topic_models.utils.data_preparation import embed_documents

handler = TextHandler("documents.txt")
handler.prepare() # create vocabulary and training data

# generate BERT data
training_bert = bert_embeddings_from_file("documents.txt", "distiluse-base-multilingual-cased")

training_dataset = COTMDataset(handler.bow, training_bert, handler.idx2token)

cotm = COTM(input_size=len(handler.vocab), bert_input_size=512, inference_type="contextual", n_components=50)

cotm.fit(training_dataset) # run the model

See the example notebook in the contextualized_topic_models/examples folder. If you want you can also compute evaluate your topics using different measures, for example coherence with the NPMI.

from contextualized_topic_models.evaluation.measures import CoherenceNPMI

with open('documents.txt',"r") as fr:
    texts = [doc.split() for doc in fr.read().splitlines()] # load text for NPMI

npmi = CoherenceNPMI(texts=texts, topics=cotm.get_topic_lists(10))
npmi.score()

Predict topics for novel documents

test_handler = TextHandler("spanish_documents.txt")
test_handler.prepare() # create vocabulary and training data

# generate BERT data
testing_bert = bert_embeddings_from_file("spanish_documents.txt", "distiluse-base-multilingual-cased")

testing_dataset = COTMDataset(test_handler.bow, testing_bert, test_handler.idx2token)
cotm.get_thetas(testing_dataset)

Team

Federico Bianchi <f.bianchi@unibocconi.it> Bocconi University
Silvia Terragni <s.terragni4@campus.unimib.it> University of Milan-Bicocca
Dirk Hovy <dirk.hovy@unibocconi.it> Bocconi University

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template. To ease the use of the library we have also incuded the rbo package, all the rights reserved to the author of that package.

History

1.0.0 (2020-04-05)

Released models with the main features implemented

0.1.0 (2020-04-04)

First release on PyPI.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

2.5.0

Mar 2, 2023

2.4.2

Nov 3, 2022

2.4.1

Nov 3, 2022

2.4.0

Oct 14, 2022

2.3.0

May 7, 2022

2.2.1

Nov 9, 2021

2.2.0

Sep 20, 2021

2.1.2

Sep 3, 2021

2.1.1

Jul 19, 2021

2.0.1

May 25, 2021

2.0.0

May 25, 2021

1.8.2

Feb 8, 2021

1.8.1

Jan 11, 2021

1.8.0

Jan 11, 2021

1.7.1

Dec 17, 2020

1.7.0

Dec 10, 2020

1.6.0

Nov 10, 2020

1.5.3

Nov 3, 2020

1.5.2

Nov 3, 2020

1.5.0

Sep 14, 2020

1.4.3

Sep 3, 2020

1.4.2

Aug 16, 2020

1.4.1

Aug 4, 2020

1.4.0

Aug 1, 2020

1.3.3

Jul 19, 2020

1.3.1

Apr 17, 2020

This version

1.0.1

Apr 8, 2020

1.0.0

Apr 5, 2020

0.4.2

Apr 4, 2020

0.1.0

Apr 4, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextualized_topic_models-1.0.1.tar.gz (23.6 kB view hashes)

Uploaded Apr 8, 2020 Source

Built Distribution

contextualized_topic_models-1.0.1-py2.py3-none-any.whl (19.8 kB view hashes)

Uploaded Apr 8, 2020 Python 2 Python 3

Hashes for contextualized_topic_models-1.0.1.tar.gz

Hashes for contextualized_topic_models-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`10a0ab4d4bbb49e5948ebbe69b7a746ae22e1df3cf27ea9598f6db3a32612c0a`
MD5	`1a04886e2e5669c30f955e8d2f508188`
BLAKE2b-256	`6ec45bc89491e7bca5625f9a3adf47504666e597e50d0ed80920730a812b0509`

Hashes for contextualized_topic_models-1.0.1-py2.py3-none-any.whl

Hashes for contextualized_topic_models-1.0.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`c5ef6d0d4fcb55a46e1ecbec0c6b0f81114e7e66c6e4ab9aba4104f8a53f60bd`
MD5	`c98ebb58a88805b411e59e26acd41b0a`
BLAKE2b-256	`d010d7519b918144151b6835120cf19f48ef468bf8a8ee28f49766d4bcbfd8f6`