Project description

🐍 snakefusion

Introduction

snakefusion is a Python package for reading, writing, and using finalfusion, fastText, floret, GloVe, and word2vec embeddings. This package is a thin wrapper around the Rust finalfusion crate.

snakefusion supports the same types of embeddings as finalfusion:

Vocabulary:
- No subwords
- Subwords
Embedding matrix:
- Array
- Memory-mapped
- Quantized
Format:
- fastText
- finalfusion
- floret
- GloVe
- word2vec

Building from source

Building snakefusion from source requires a Rust toolchain that is installed through rustup and setuptools-rust:

$ pip install --upgrade setuptools-rust

You can then build and install snakefusion in your environment:

$ pip install .

Usage

Embeddings can be loaded as follows:

import snakefusion

# Loading embeddings in finalfusion format
embeds = snakefusion.Embeddings("myembeddings.fifu")

# Or if you want to memory-map the embedding matrix:
embeds = snakefusion.Embeddings("myembeddings.fifu", mmap=True)

# fastText format
embeds = snakefusion.Embeddings.read_fasttext("myembeddings.bin")

# floret format
embeds = snakefusion.Embeddings.read_floret_text("myembeddings.floret")

# word2vec format
embeds = snakefusion.Embeddings.read_word2vec("myembeddings.w2v")

You can then compute an embedding, perform similarity queries, or analogy queries:

e = embeds.embedding("Tübingen")

# default similarity query for "Tübingen"
embeds.word_similarity("Tübingen")

# similarity query based on a vector, returning the closest embedding to
# the input vector, skipping "Tübingen"
embeds.embeddings_similarity(e, skip={"Tübingen"})

# default analogy query
embeds.analogy("Berlin", "Deutschland", "Amsterdam")

# analogy query allowing "Deutschland" as answer
embeds.analogy("Berlin", "Deutschland", "Amsterdam", mask=(True,False,True))

If you want to operate directly on the full embedding matrix, you can get a copy of this matrix through:

# get copy of embedding matrix, changes to this won't touch the original matrix
e.matrix_copy()

Finally access to the vocabulary is provided through:

v = e.vocab()
# get a list of indices associated with "Tübingen"
v.item_to_indices("Tübingen")

# get a list of `(ngram, index)` tuples for "Tübingen"
v.ngram_indices("Tübingen")

# get a list of subword indices for "Tübingen"
v.subword_indices("Tübingen")

More usage examples can be found in the examples directory.

Where to go from here

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.6

Dec 17, 2021

0.1.5

Dec 12, 2021

0.1.4

Dec 12, 2021

0.1.3

Dec 5, 2021

0.1.2

Dec 2, 2021

This version

0.1.0

Dec 1, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snakefusion-0.1.0.tar.gz (19.8 kB view hashes)

Uploaded Dec 1, 2021 Source

Hashes for snakefusion-0.1.0.tar.gz

Hashes for snakefusion-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b361504fbf932885456dbdb18117bf2761231cce1af7fc8ee95a0c1a6a4a0134`
MD5	`50dc8b00e20477aaae9a3c3d9ee052c3`
BLAKE2b-256	`5a5901c17283875ff5eeb86f2f2d8cf2e1219d047dbf5dcc1159ff95b2d2660f`