No project description provided

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

BunkaTopics

BunkaTopics is a Topic Modeling package that leverages Embeddings and focuses on Topic Representation to extract meaningful and interpretable topics from a list of documents.

Installation

Before installing bunkatopics, please install the following packages:

Load the spacy language models

python -m spacy download fr_core_news_lg

python -m spacy download en_core_web_sm

Eventually, install bunkatopic using pip

pip install bunkatopics

Quick Start with BunkaTopics

from bunkatopics import BunkaTopics
import pandas as pd

data = pd.read_csv('data/imdb.csv', index_col = [0])
data = data.sample(2000, random_state = 42)

# Instantiate the model, extract ther terms and Embed the documents

model = BunkaTopics(data, # dataFrame
                    text_var = 'description', # Text Columns
                    index_var = 'imdb',  # Index Column (Mandatory)
                    extract_terms=True, # extract Terms ?
                    terms_embeddings=True, # extract terms Embeddings?
                    docs_embeddings=True, # extract Docs Embeddings?
                    embeddings_model="distiluse-base-multilingual-cased-v1", # Chose an embeddings Model
                    multiprocessing=True, # Multiprocessing of Embeddings
                    language="en", # Chose between English "en" and French "fr"
                    sample_size_terms = len(data),
                    terms_limit=10000, # Top Terms to Output
                    terms_ents=True, # Extract entities
                    terms_ngrams=(1, 2), # Chose Ngrams to extract
                    terms_ncs=True, # Extract Noun Chunks
                    terms_include_pos=["NOUN", "PROPN", "ADJ"], # Include Part-of-Speech
                    terms_include_types=["PERSON", "ORG"]) # Include Entity Types

# Extract the topics

topics = model.get_clusters(topic_number= 15, # Number of Topics
                    top_terms_included = 1000, # Compute the specific terms from the top n terms
                    top_terms = 5, # Most specific Terms to describe the topics
                    term_type = "lemma", # Use "lemma" of "text"
                    ngrams = [1, 2], # N-grams for Topic Representation
                    clusterer = 'hdbscan') # Chose between Kmeans and HDBSCAN

# Visualize the clusters. It is adviced to choose less that 5 terms - top_terms = 5 - to avoid overchanging the Figure

fig = model.visualize_clusters(search = None, 
width=1000, 
height=1000, 
fit_clusters=True,  # Fit Umap to well visually separate clusters
density_plot=False) # Plot a density map to get a territory overview

fig.show()


centroid_documents = model.get_centroid_documents(top_elements=2)

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.46.1

May 14, 2024

0.46

Apr 11, 2024

0.45

Jan 21, 2024

0.43

Oct 20, 2023

0.42

Oct 20, 2023

0.41

Oct 7, 2023

0.39

Jun 22, 2023

0.38

Jun 13, 2023

0.37

Jun 13, 2023

0.36

Jun 7, 2023

0.35

Jun 7, 2023

0.34

Jan 10, 2023

0.33

Jan 2, 2023

0.32

Jan 2, 2023

0.31

Jan 2, 2023

This version

0.30

Dec 19, 2022

0.29

Dec 17, 2022

0.28

Dec 15, 2022

0.27

Dec 15, 2022

0.26

Dec 5, 2022

0.25

Oct 23, 2022

0.24

Oct 21, 2022

0.23

Oct 20, 2022

0.22

Oct 20, 2022

0.21

Oct 20, 2022

0.20

Jul 22, 2022

0.19

Jun 28, 2022

0.18

Jun 28, 2022

0.17

Jun 28, 2022

0.16

Jun 27, 2022

0.15

Jun 25, 2022

0.14

May 27, 2022

0.13

May 27, 2022

0.12

May 25, 2022

0.11

May 24, 2022

0.8

May 24, 2022

0.7

May 24, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bunkatopics-0.30.tar.gz (14.2 kB view hashes)

Uploaded Dec 19, 2022 Source

Built Distribution

bunkatopics-0.30-py3-none-any.whl (18.5 kB view hashes)

Uploaded Dec 19, 2022 Python 3

Hashes for bunkatopics-0.30.tar.gz

Hashes for bunkatopics-0.30.tar.gz
Algorithm	Hash digest
SHA256	`eae85b429f1f7673d0002c2e5da0cc670cd8ccb514e9ea044f19416bd67bc610`
MD5	`c6617866ed513c901d927842618f9f6f`
BLAKE2b-256	`4d0cc75b96c60854ab088d05fe1c5f19af9825d18f90a8649e3c7a5b363f4b9c`

Hashes for bunkatopics-0.30-py3-none-any.whl

Hashes for bunkatopics-0.30-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cbc2447e934b47889bc64883eb9e95ee06e43c0a06b561d6cdd35da4a3155539`
MD5	`1be3ca9d22a733d6caa1ed89d5373268`
BLAKE2b-256	`5b632450c0df54c9e129b4a878ef182a34a0bbe8d34aedd95a9aaef13dcbdec1`