vexpresso

No project description provided

Project description

Features

🍵 Simple: Vexpresso is lightweight and is very easy to get started!

🔌 Flexible: Unlike many other vector databases, Vexpresso supports arbitrary datatypes. This means that you can query muti-modal objects (images, audio, video, etc...)

🌐 Scalable: Because Vexpresso uses Daft, it can be scaled using Ray to multi-gpu / cpu clusters.

📚 Persistent: Easy Saving and Loading functionality: Vexpresso has easily accessible functions for saving / loading to huggingface datasets.

Installation

To install from PyPi:

pip install vexpresso

To install from source:

git clone git@github.com:shyamsn97/vexpresso.git
cd vexpresso
pip install -e .

Usage

🔥 Check out our Showcase notebook for a more detailed walkthrough!

In this simple example, we create a simple collection and embed using huggingface sentence transformers.

from typing import List, Any
import vexpresso
# import embedding functions from vexpresso
import vexpresso.embedding_functions as ef

# creating a collection object!
collection = vexpresso.create(
    data = {
        "documents":[
            "This is document1",
            "This is document2",
            "This is document3",
            "This is document4",
            "This is document5",
            "This is document6"
        ],
        "source":["notion", "google-docs", "google-docs", "notion", "google-docs", "google-docs"],
        "num_lines":[10, 20, 30, 40, 50, 60]
    }
    # backend="ray" # turn this flag on to start / connect to a ray cluster!
)

# create a simple embedding function from sentence_transformers
def hf_embed_fn(content: List[Any]):
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
    return model.encode(content, convert_to_tensor=True).detach().cpu().numpy()

# or use a langchain embedding function
def langchain_embed_fn(content: List[Any]):
    from langchain.embeddings import OpenAIEmbeddings
    embeddings_model = OpenAIEmbeddings()
    return embeddings_model.embed_documents(content)

# embed function creates a column in the collection with embeddings. There can be more than one embedding column!
# lazy execution until .execute is called
collection = collection.embed(
    "documents",
    embedding_fn=hf_embed_fn,
    to="document_embeddings",
    # lazy=False # if this is false, execute doesn't need to be called
).execute()

# creating a queried collection with a subset of content closest to query
queried_collection = collection.query(
    "document_embeddings",
    query="query document6",
    k = 4, # return 2 closest
    lazy=False
    # query_embedding=[query1, query2, ...]
    # filter_conditions={"metadata_field":{"operator, ex: 'eq'":"value"}} # optional metadata filter
)

# batch query -- return a list of collections
# batch_queried_collection = collection.batch_query(
#     "document_embeddings",
#     queries=["doc1", "doc2"],
#     k = 2
# )

# filter collection for documents with num_lines less than or equal to 30
filtered_collection = queried_collection.filter(
    {
        "num_lines": {"lte":30}
    }
).execute()

# show dataframe
filtered_collection.show()

# convert to dictionary
filtered_dict = filtered_collection.to_dict()
documents = filtered_dict["documents"]

# add an entry!
collection = collection.add(
    [
        {"documents":"new documents 1", "source":"notion", "num_lines":2},
        {"documents":"new documents 2", "source":"google-docs", "num_lines":40}
    ]
)
collection = collection.execute()

Resources

Contributing

Feel free to make a pull request or open an issue for a feature!

Project details

Release history Release notifications | RSS feed

This version

0.0.2

Jun 21, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vexpresso-0.0.2.tar.gz (21.3 kB view hashes)

Uploaded Jun 21, 2023 Source

Built Distribution

vexpresso-0.0.2-py3-none-any.whl (23.7 kB view hashes)

Uploaded Jun 21, 2023 Python 3

Hashes for vexpresso-0.0.2.tar.gz

Hashes for vexpresso-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`731cfe20fb46de7982e2fdfc3d3dd69bd14832de473d70a94f71828af14ad4b1`
MD5	`b6e2e697be4b5368a327905f17df7f3d`
BLAKE2b-256	`7812de71f35240e77b4e1876c0c09a89fabaa5f0c3ae260d8ed3f23bf34cfac8`

Hashes for vexpresso-0.0.2-py3-none-any.whl

Hashes for vexpresso-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d1d7b95afd90fc7f5b0a4ab21a5847906c969cdac9f43dc39a108725b8d9c4cc`
MD5	`6d5c036174386135784f293aef331e79`
BLAKE2b-256	`9bbbe2c2292b4b9f0eae1411832f904488bee582f9cd1622424d68cdd7cf4510`