Skip to main content

Topic Modeling using Transformers and advanced visualization

Project description

PyPI - Python PyPI - PyPi

Bunkatopics

Bunkatopics is a Topic Modeling Visualisation Method that leverages Transformers from HuggingFace through langchain. It is built with the same philosophy as BERTopic but goes deeper in the visualization to help users grasp quickly and intuitively the content of thousands of text. It aslo allows a supervided visual representation by letting the user create continnums with natural language.

Installation

First, create a new virtual environment using pyenv

pyenv virtualenv 3.9 bunkatopics_env

Activate the environment

pyenv activate bunkatopics_env

Then Install the Bunkatopics package:

pip install bunkatopics

Install the spacy tokenizer model for english:

python -m spacy download en_core_web_sm

Getting Started

Name Link
Visual Topic Modeling With Bunkatopics Open In Colab

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:

from bunkatopics import bunkatopics
from sklearn.datasets import fetch_20newsgroups
import random
 
full_docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
full_docs_random = random.sample(full_docs, 1000)

You can the load any model from langchain. Some of them might be large, please check the langchain documentation

If you want to start with a small model:

from langchain.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

bunka = Bunka(model_hf=embedding_model)

bunka.fit(full_docs)
df_topics = bunka.get_topics(n_clusters = 20)

If you want a bigger LLM:

from langchain.embeddings import HuggingFaceInstructEmbeddings

embedding_model = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-large")

bunka = Bunka(model_hf=embedding_model)

bunka.fit(full_docs)
df_topics = bunka.get_topics(n_clusters = 20)

Then, we can visualize

topic_fig = bunka.visualize_topics( width=800, height=800)
topic_fig
...

The map display the different texts on a 2-Dimensional unsupervised scale. Every region of the map is a topic described by its most specific terms.

bourdieu_fig = bunka.visualize_bourdieu(x_left_words=["past"],
                                        x_right_words=["future", "futuristic"],
                                        y_top_words=["politics", "Government"],
                                        y_bottom_words=["cultural phenomenons"],
                                        height=2000,
                                        width=2000)

The power of this visualisation is to constrain the axis by creating continuums and looking how the data distribute over these continuums. The inspiration is coming from the French sociologist Bourdieu, who projected items on 2 Dimensional maps.

Functionality

Here are all the things you can do with Bunkatopics

Common

Below, you will find an overview of common functions in BERTopic.

Method Code
Fit the model .fit(docs)
Fit the model and get the topics .fit_transform(docs)
Acces the topics .get_topics(n_clusters=10)
Access the top documents per topic .get_top_documents()
Access the distribution of topics .get_topic_repartition()
Visualize the topics on a Map .visualize_topics()
Visualize the topics on Natural Language Supervised axis .visualize_bourdieu()
Access the Coherence of Topics .get_topic_coherence()
Get the closest documents to your search .search('politics')

Attributes

You can access several attributes

Attribute Description
.docs The documents stores as a Document pydantic model
.topics The Topics stored as a Topic pydantic model.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunkatopics-0.36.tar.gz (17.1 kB view hashes)

Uploaded Source

Built Distribution

bunkatopics-0.36-py3-none-any.whl (21.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page