No project description provided
Project description
BunkaTopics
BunkaTopics is a Topic Modeling package that leverages Embeddings and focuses on Topic Representation to extract meaningful and interpretable topics from a list of documents.
Installation
Before installing bunkatopics, please install the following packages:
Load the spacy language models
python -m spacy download fr_core_news_lg
python -m spacy download en_core_web_sm
Eventually, install bunkatopic using pip
pip install bunkatopics
Quick Start with BunkaTopics
from bunkatopics import BunkaTopics
import pandas as pd
data = pd.read_csv('data/imdb.csv', index_col = [0])
data = data.sample(2000, random_state = 42)
# Instantiate the model, extract ther terms and Embed the documents
model = BunkaTopics(data, # dataFrame
text_var = 'description', # Text Columns
index_var = 'imdb', # Index Column (Mandatory)
extract_terms=True, # extract Terms ?
terms_embeddings=True, # extract terms Embeddings?
docs_embeddings=True, # extract Docs Embeddings?
embeddings_model="distiluse-base-multilingual-cased-v1", # Chose an embeddings Model
multiprocessing=True, # Multiprocessing of Embeddings
language="en", # Chose between English "en" and French "fr"
sample_size_terms = len(data),
terms_limit=10000, # Top Terms to Output
terms_ents=True, # Extract entities
terms_ngrams=(1, 2), # Chose Ngrams to extract
terms_ncs=True, # Extract Noun Chunks
terms_include_pos=["NOUN", "PROPN", "ADJ"], # Include Part-of-Speech
terms_include_types=["PERSON", "ORG"]) # Include Entity Types
# Extract the topics
topics = model.get_clusters(topic_number= 15, # Number of Topics
top_terms_included = 1000, # Compute the specific terms from the top n terms
top_terms = 5, # Most specific Terms to describe the topics
term_type = "lemma", # Use "lemma" of "text"
ngrams = [1, 2], # N-grams for Topic Representation
clusterer = 'hdbscan') # Chose between Kmeans and HDBSCAN
# Visualize the clusters. It is adviced to choose less that 5 terms - top_terms = 5 - to avoid overchanging the Figure
fig = model.visualize_clusters(search = None,
width=1000,
height=1000,
fit_clusters=True, # Fit Umap to well visually separate clusters
density_plot=False) # Plot a density map to get a territory overview
fig.show()
centroid_documents = model.get_centroid_documents(top_elements=2)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
bunkatopics-0.30.tar.gz
(14.2 kB
view hashes)
Built Distribution
bunkatopics-0.30-py3-none-any.whl
(18.5 kB
view hashes)
Close
Hashes for bunkatopics-0.30-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbc2447e934b47889bc64883eb9e95ee06e43c0a06b561d6cdd35da4a3155539 |
|
MD5 | 1be3ca9d22a733d6caa1ed89d5373268 |
|
BLAKE2b-256 | 5b632450c0df54c9e129b4a878ef182a34a0bbe8d34aedd95a9aaef13dcbdec1 |