No project description provided
Project description
Vexpresso
Vexpresso is a simple and scalable multi-modal vector database built with Daft
Features
🍵 Simple: Vexpresso is lightweight and is very easy to get started!
🔌 Flexible: Unlike many other vector databases, Vexpresso supports arbitrary datatypes. This means that you can query muti-modal objects (images, audio, video, etc...)
🌐 Scalable: Because Vexpresso uses Daft, it can be scaled using Ray to multi-gpu / cpu clusters.
📚 Persistent: Easy Saving and Loading functionality: Vexpresso has easily accessible functions for saving / loading to huggingface datasets.
Installation
To install from PyPi:
pip install vexpresso
To install from source:
git clone git@github.com:shyamsn97/vexpresso.git
cd vexpresso
pip install -e .
Usage
🔥 Check out our Showcase notebook for a more detailed walkthrough!
In this simple example, we create a simple collection and embed using huggingface sentence transformers.
from typing import List, Any
import vexpresso
# import embedding functions from vexpresso
import vexpresso.embedding_functions as ef
# creating a collection object!
collection = vexpresso.create(
data = {
"documents":[
"This is document1",
"This is document2",
"This is document3",
"This is document4",
"This is document5",
"This is document6"
],
"source":["notion", "google-docs", "google-docs", "notion", "google-docs", "google-docs"],
"num_lines":[10, 20, 30, 40, 50, 60]
}
# backend="ray" # turn this flag on to start / connect to a ray cluster!
)
# create a simple embedding function from sentence_transformers
def hf_embed_fn(content: List[Any]):
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
return model.encode(content, convert_to_tensor=True).detach().cpu().numpy()
# or use a langchain embedding function
def langchain_embed_fn(content: List[Any]):
from langchain.embeddings import OpenAIEmbeddings
embeddings_model = OpenAIEmbeddings()
return embeddings_model.embed_documents(content)
# embed function creates a column in the collection with embeddings. There can be more than one embedding column!
# lazy execution until .execute is called
collection = collection.embed(
"documents",
embedding_fn=hf_embed_fn,
to="document_embeddings",
# lazy=False # if this is false, execute doesn't need to be called
).execute()
# creating a queried collection with a subset of content closest to query
queried_collection = collection.query(
"document_embeddings",
query="query document6",
k = 4, # return 2 closest
lazy=False
# query_embedding=[query1, query2, ...]
# filter_conditions={"metadata_field":{"operator, ex: 'eq'":"value"}} # optional metadata filter
)
# batch query -- return a list of collections
# batch_queried_collection = collection.batch_query(
# "document_embeddings",
# queries=["doc1", "doc2"],
# k = 2
# )
# filter collection for documents with num_lines less than or equal to 30
filtered_collection = queried_collection.filter(
{
"num_lines": {"lte":30}
}
).execute()
# show dataframe
filtered_collection.show()
# convert to dictionary
filtered_dict = filtered_collection.to_dict()
documents = filtered_dict["documents"]
# add an entry!
collection = collection.add(
[
{"documents":"new documents 1", "source":"notion", "num_lines":2},
{"documents":"new documents 2", "source":"google-docs", "num_lines":40}
]
)
collection = collection.execute()
Resources
Contributing
Feel free to make a pull request or open an issue for a feature!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for vexpresso-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d1d7b95afd90fc7f5b0a4ab21a5847906c969cdac9f43dc39a108725b8d9c4cc |
|
MD5 | 6d5c036174386135784f293aef331e79 |
|
BLAKE2b-256 | 9bbbe2c2292b4b9f0eae1411832f904488bee582f9cd1622424d68cdd7cf4510 |