Advanced Topic Visualization

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Bunkatopics

Bunkatopics is a Topic Modeling Visualisation, Frame Analysis & Retrieval Augmented Generation (RAG) package that leverages LLMs. It is built with the same philosophy as BERTopic but goes deeper in the visualization to help users grasp quickly and intuitively the content of thousands of text, as well as giving the opportunity to the user to create its own frames.

Bunkatopics is built on top of langchain.

Installation via pip

First, create a new virtual environment using pyenv

pyenv virtualenv 3.10 bunkatopics_env

Activate the environment

pyenv activate bunkatopics_env

Then Install the Bunkatopics package:

pip install bunkatopics==0.42

Pipeline

Installation via Git Clone

pip install poetry
git clone https://github.com/charlesdedampierre/BunkaTopics.git
cd BunkaTopics

# Create the environment from the .lock file. 
poetry install # This will install all packages in the .lock file inside a virtual environmnet

# Start the environment
poetry shell

Colab Example

Name	Link
Visual Topic Modeling With Bunkatopics

Quick Start

Install the spacy tokenizer model for english:

python -m spacy download en_core_web_sm

We start by Loading Trump data from HuggingFace datasets

from bunkatopics.functions.clean_text import clean_tweet
import random
from datasets import load_dataset

dataset = load_dataset("rguo123/trump_tweets")["train"]["content"]
full_docs = random.sample(dataset, 5000)
full_docs = [clean_tweet(x) for x in full_docs] # Cleaning the tweets
full_docs = [x for x in full_docs if len(x)>50] # Removing small tweets, they are not informative enough

You can the load any embedding model from langchain. Some of them might be large, please check the langchain documentation

Topic Modeling

from bunkatopics import Bunka
from langchain.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") # We use a small model
bunka = Bunka(embedding_model=embedding_model)
bunka.fit(full_docs)

# Get the list of topics
bunka.get_topics(n_clusters = 20)

Then, we can visualize the topics computed

bunka.visualize_topics( width=800, height=800)

Topic Modeling with GenAI Summarization of Topics

You can get the topics summarized by Generative AI. Use any model from Langchain. We use the 7B-instruct model of Mistral AI thought llama.cpp and the langchain integration.

import os
from langchain.llms import HuggingFaceHub

# Using Mistral AI to Summarize the topics

llm = HuggingFaceHub(repo_id = 'mistralai/Mistral-7B-v0.1', huggingfacehub_api_token = os.environ.get("HF_TOKEN")
)
df_topics = bunka.get_clean_topic_name(generative_model = llm)
print(df_topics)
bunka.visualize_topics( width=800, height=800)

Retrieval Augmented Generation (RAG)

It is possible to to Retrieval Augmented Generation (RAG) thanks to langchain integration with different Generative Models.

query = 'What is the  main fight of Donald Trump ?'
res = bunka.rag_query(query = query, generative_model = generative_model, top_doc = 5)
print(res['result'])

OUTPUT:

The main fight of Donald Trump in the presidential elections of 2016 was against Hillary Clinton. He believed he was the best candidate for president and was able to beat many other candidates in the field due to his fame and political opinions.

for doc in res['source_documents']:
    text = doc.page_content.strip()
    print(text)

OUTPUT:

what do you say donald run for president
why only donald trump can beat hillary/n
via donald trump on who he likes for president donald trump/n
if the 2016 presidential field is so deep why is donaldtrump beating so many of their stars
donald trump is a respected businessman with insightful political opinions

Bourdieu Map

The Bourdieu map display the different texts on a 2-Dimensional unsupervised scale. Every region of the map is a topic described by its most specific terms. CLusters are created and the names are also summarized using Generative AI.

The power of this visualisation is to constrain the axis by creating continuums and looking how the data distribute over these continuums. The inspiration is coming from the French sociologist Bourdieu, who projected items on 2 Dimensional maps.

from langchain.llms import HuggingFaceHub

llm = HuggingFaceHub(repo_id = 'mistralai/Mistral-7B-v0.1', huggingfacehub_api_token = os.environ.get("HF_TOKEN")
)

manual_axis_name = {
                    'x_left_name':'positive',
                    'x_right_name':'negative',
                    'y_top_name':'women',
                    'y_bottom_name':'men',
                    }

bourdieu_fig = bunka.visualize_bourdieu(
    generative_model=llm,
    x_left_words=["this is a positive content"],
    x_right_words=["this is a negative content"],
    y_top_words=["this is about women"],
    y_bottom_words=["this is about men"],
    height=800,
    width=800,
    display_percent=True,
    clustering=True,
    topic_n_clusters=10,
    topic_terms=5,
    topic_top_terms_overall=500,
    topic_gen_name=True,
    convex_hull = True,
    radius_size = 0.5,
    manual_axis_name = manual_axis_name
)
bourdieu_fig.show()

Streamlit

Run Streamlit to use BunkaTopics with a nice front-end.

python -m streamlit run streamlit/app.py

Multilanguage

The package use Spacy to extract meaningfull terms for the topic represenation.

If you wish to change language to french, first, download the corresponding spacy model:

python -m spacy download fr_core_news_lg

embedding_model = HuggingFaceEmbeddings(model_name="distiluse-base-multilingual-cased-v2")

bunka = Bunka(embedding_model=embedding_model, language = 'fr_core_news_lg')

bunka.fit(full_docs)
bunka.get_topics(n_clusters = 20)

Functionality

Here are all the things you can do with Bunkatopics

Common

Below, you will find an overview of common functions in Bunkatopics.

Method	Code
Fit the model	`.fit(docs)`
Fit the model and get the topics	`.fit_transform(docs)`
Acces the topics	`.get_topics(n_clusters=10)`
RAG	`.rag_query(query, generative_model)`
Access the top documents per topic	`.get_clean_topic_name()`
Access the distribution of topics	`.get_topic_repartition()`
Visualize the topics on a Map	`.visualize_topics()`
Visualize the topics on Natural Language Supervised axis	`.visualize_bourdieu()`
Access the Coherence of Topics	`.get_topic_coherence()`
Get the closest documents to your search	`.search('politics')`

Attributes

You can access several attributes

Attribute	Description
`.docs`	The documents stores as a Document pydantic model
`.topics`	The Topics stored as a Topic pydantic model.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.46.1

May 14, 2024

0.46

Apr 11, 2024

0.45

Jan 21, 2024

This version

0.43

Oct 20, 2023

0.42

Oct 20, 2023

0.41

Oct 7, 2023

0.39

Jun 22, 2023

0.38

Jun 13, 2023

0.37

Jun 13, 2023

0.36

Jun 7, 2023

0.35

Jun 7, 2023

0.34

Jan 10, 2023

0.33

Jan 2, 2023

0.32

Jan 2, 2023

0.31

Jan 2, 2023

0.30

Dec 19, 2022

0.29

Dec 17, 2022

0.28

Dec 15, 2022

0.27

Dec 15, 2022

0.26

Dec 5, 2022

0.25

Oct 23, 2022

0.24

Oct 21, 2022

0.23

Oct 20, 2022

0.22

Oct 20, 2022

0.21

Oct 20, 2022

0.20

Jul 22, 2022

0.19

Jun 28, 2022

0.18

Jun 28, 2022

0.17

Jun 28, 2022

0.16

Jun 27, 2022

0.15

Jun 25, 2022

0.14

May 27, 2022

0.13

May 27, 2022

0.12

May 25, 2022

0.11

May 24, 2022

0.8

May 24, 2022

0.7

May 24, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunkatopics-0.43.tar.gz (85.4 kB view hashes)

Uploaded Oct 20, 2023 Source

Built Distribution

bunkatopics-0.43-py3-none-any.whl (53.6 kB view hashes)

Uploaded Oct 20, 2023 Python 3

Hashes for bunkatopics-0.43.tar.gz

Hashes for bunkatopics-0.43.tar.gz
Algorithm	Hash digest
SHA256	`235beb502c35ca973b7268cddcedb39183a9a0c1a1f9f8c2ec66b88e17343513`
MD5	`0d5add060fb9a8b6799ebbe6cc18c79a`
BLAKE2b-256	`c76ee0bd9aa58c212b6a2793bbe78f4cfd6485706dd83224d4da86a3cecd16ac`

Hashes for bunkatopics-0.43-py3-none-any.whl

Hashes for bunkatopics-0.43-py3-none-any.whl
Algorithm	Hash digest
SHA256	`038a555870d2d12b1325d8069c41e820ad565b85196a6b4489bcd0b6a356573c`
MD5	`ad489cb3695c69ed2ed1ea77e57285fb`
BLAKE2b-256	`ed21bb1a745429a0d22f8f30353221b35978a364e4946992751e9c7d38de56a6`