Skip to main content

Read and use various word embedding formats

Project description

🐍 snakefusion

Introduction

snakefusion is a Python package for reading, writing, and using finalfusion, fastText, floret, GloVe, and word2vec embeddings. This package is a thin wrapper around the Rust finalfusion crate.

snakefusion supports the same types of embeddings as finalfusion:

  • Vocabulary:
    • No subwords
    • Subwords
  • Embedding matrix:
    • Array
    • Memory-mapped
    • Quantized
  • Format:
    • fastText
    • finalfusion
    • floret
    • GloVe
    • word2vec

Building from source

Building snakefusion from source requires a Rust toolchain that is installed through rustup and setuptools-rust:

$ pip install --upgrade setuptools-rust

You can then build and install snakefusion in your environment:

$ pip install .

Usage

Embeddings can be loaded as follows:

import snakefusion

# Loading embeddings in finalfusion format
embeds = snakefusion.Embeddings("myembeddings.fifu")

# Or if you want to memory-map the embedding matrix:
embeds = snakefusion.Embeddings("myembeddings.fifu", mmap=True)

# fastText format
embeds = snakefusion.Embeddings.read_fasttext("myembeddings.bin")

# floret format
embeds = snakefusion.Embeddings.read_floret_text("myembeddings.floret")

# word2vec format
embeds = snakefusion.Embeddings.read_word2vec("myembeddings.w2v")

You can then compute an embedding, perform similarity queries, or analogy queries:

e = embeds.embedding("Tübingen")

# default similarity query for "Tübingen"
embeds.word_similarity("Tübingen")

# similarity query based on a vector, returning the closest embedding to
# the input vector, skipping "Tübingen"
embeds.embeddings_similarity(e, skip={"Tübingen"})

# default analogy query
embeds.analogy("Berlin", "Deutschland", "Amsterdam")

# analogy query allowing "Deutschland" as answer
embeds.analogy("Berlin", "Deutschland", "Amsterdam", mask=(True,False,True))

If you want to operate directly on the full embedding matrix, you can get a copy of this matrix through:

# get copy of embedding matrix, changes to this won't touch the original matrix
e.matrix_copy()

Finally access to the vocabulary is provided through:

v = e.vocab()
# get a list of indices associated with "Tübingen"
v.item_to_indices("Tübingen")

# get a list of `(ngram, index)` tuples for "Tübingen"
v.ngram_indices("Tübingen")

# get a list of subword indices for "Tübingen"
v.subword_indices("Tübingen")

More usage examples can be found in the examples directory.

Where to go from here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snakefusion-0.1.0.tar.gz (19.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page