Skip to main content

No project description provided

Project description

Scikit-Bloom

An excuse to play with Rust, but also a neat trick for sklearn!

This package contains some bloom tricks for text pipelines in scikit-learn. To learn more about this trick, check out this blogpost.

You can install it via:

python -m pip install scikit-bloom

And you can import the components via:

from skbloom import BloomVectorizer, BloomishVectorizer, SlowBloomVectorizer

BloomVectorizer().fit(X).transform(X)
BloomishVectorizer().fit(X).transform(X)

The BloomVectorizer will use rust under the hood for the hashing to construct the bloom representation. The BloomishVectorizer will just run the HashingVectorizer from scikit-learn multiple times in sequence. The SlowBloomVectorizer is pretty much the same as the BloomVectizer in terms of features, but is implemented in Python.

Benchmarks

I ran a quick benchmark, which seems to suggest the approach is pretty speedy.

Show me the code
import time
from datasets import load_dataset
from skbloom import BloomVectorizer, BloomishVectorizer, SlowBloomVectorizer
from sklearn.feature_extraction.text import HashingVectorizer

dataset = load_dataset("clinc_oos", "plus")
texts = dataset['train']['text'] * 10

trials = [BloomVectorizer(n_features=10_000), 
          BloomishVectorizer(n_features=10_000), 
          SlowBloomVectorizer(n_features=10_000), 
          HashingVectorizer(n_features=10_000)]

for trial in trials:
    tic = time.time()
    trial.fit_transform(texts)
    toc = time.time()
    print(f"{trial.__class_.__name__}: {toc - tic}")

In this benchmark we're creating a

Approach Time taken Description
BloomVectorizer 1.562 The speedy rust implementation
BloomishVectorizer 2.111 Using sklearn's implementation sequentially
SlowBloomVectorizer 5.259 A pure python implementation
HashingVectorizer 0.695 Using sklearn's hashing vectorizer to only hash once

You can also choose to run the BloomVectorizer by just hashing once and it seems to be competative with the HashingVectorizer.

Show me the code
import time
from datasets import load_dataset
from skbloom import BloomVectorizer, BloomishVectorizer, SlowBloomVectorizer
from sklearn.feature_extraction.text import HashingVectorizer

dataset = load_dataset("clinc_oos", "plus")
texts = dataset['train']['text'] * 10

for feats in [3000, 5000, 10000, 20000, 100_000]:
    trials = [BloomVectorizer(n_hash=1, n_features=feats), HashingVectorizer(n_features=feats)]
    for trial in trials:
        tic = time.time()
        trial.fit_transform(texts)
        toc = time.time()
        print(f"{feats}: {trial.__class__.__name__}: {toc - tic}")
Number of feats BloomVectorizer HashingVectorizer
3000 0.6071 0.6864
5000 0.6092 0.6947
10000 0.6123 0.6911
20000 0.6124 0.6918
100000 0.6108 0.6938

I want to be careful with suggesting that the BloomVectorizer is always faster because the HashingVectorizer comes with way more features. You can build n-gram representations, just to mention one example, which the BloomVectorizer does not do. But it does seem like it is at least competative, which is neat.

Important

In fairness, while this trick is interesting ... you might be fine just using the HashingVectorizer that just comes with sklearn. This project works, but it was also an excuse for me to try out rust.

It's a nice motivating example for me to learn rust, partially because it's a tangible example from a field that I am familiar with. But it's also been a relatively low investment to rewrite an expensive bit of code in rust.

Development

These are mainly some notes for myself.

To install all of this locally;

python -m pip install maturin 
maturin develop
python -m pip install -e .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scikit_bloom-0.2.1.tar.gz (8.0 kB view hashes)

Uploaded Source

Built Distributions

scikit_bloom-0.2.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

scikit_bloom-0.2.1-pp310-pypy310_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ s390x

scikit_bloom-0.2.1-pp310-pypy310_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ppc64le

scikit_bloom-0.2.1-pp310-pypy310_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARMv7l

scikit_bloom-0.2.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

scikit_bloom-0.2.1-pp310-pypy310_pp73-manylinux_2_5_i686.manylinux1_i686.whl (1.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.5+ i686

scikit_bloom-0.2.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

scikit_bloom-0.2.1-pp39-pypy39_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ s390x

scikit_bloom-0.2.1-pp39-pypy39_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ppc64le

scikit_bloom-0.2.1-pp39-pypy39_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARMv7l

scikit_bloom-0.2.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

scikit_bloom-0.2.1-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.whl (1.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.5+ i686

scikit_bloom-0.2.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

scikit_bloom-0.2.1-pp38-pypy38_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ s390x

scikit_bloom-0.2.1-pp38-pypy38_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ppc64le

scikit_bloom-0.2.1-pp38-pypy38_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARMv7l

scikit_bloom-0.2.1-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

scikit_bloom-0.2.1-pp38-pypy38_pp73-manylinux_2_5_i686.manylinux1_i686.whl (1.2 MB view hashes)

Uploaded PyPy manylinux: glibc 2.5+ i686

scikit_bloom-0.2.1-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.3 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ s390x

scikit_bloom-0.2.1-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.2 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ppc64le

scikit_bloom-0.2.1-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.1 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ARMv7l

scikit_bloom-0.2.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ARM64

scikit_bloom-0.2.1-cp312-none-win_amd64.whl (118.6 kB view hashes)

Uploaded CPython 3.12 Windows x86-64

scikit_bloom-0.2.1-cp312-none-win32.whl (116.0 kB view hashes)

Uploaded CPython 3.12 Windows x86

scikit_bloom-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

scikit_bloom-0.2.1-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.3 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ s390x

scikit_bloom-0.2.1-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.2 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ppc64le

scikit_bloom-0.2.1-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.1 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARMv7l

scikit_bloom-0.2.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARM64

scikit_bloom-0.2.1-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl (1.2 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.5+ i686

scikit_bloom-0.2.1-cp312-cp312-macosx_11_0_arm64.whl (243.7 kB view hashes)

Uploaded CPython 3.12 macOS 11.0+ ARM64

scikit_bloom-0.2.1-cp312-cp312-macosx_10_12_x86_64.whl (242.5 kB view hashes)

Uploaded CPython 3.12 macOS 10.12+ x86-64

scikit_bloom-0.2.1-cp311-none-win_amd64.whl (118.8 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

scikit_bloom-0.2.1-cp311-none-win32.whl (116.4 kB view hashes)

Uploaded CPython 3.11 Windows x86

scikit_bloom-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

scikit_bloom-0.2.1-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ s390x

scikit_bloom-0.2.1-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.2 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ppc64le

scikit_bloom-0.2.1-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.1 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARMv7l

scikit_bloom-0.2.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

scikit_bloom-0.2.1-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.whl (1.2 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.5+ i686

scikit_bloom-0.2.1-cp311-cp311-macosx_11_0_arm64.whl (243.5 kB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

scikit_bloom-0.2.1-cp311-cp311-macosx_10_12_x86_64.whl (242.6 kB view hashes)

Uploaded CPython 3.11 macOS 10.12+ x86-64

scikit_bloom-0.2.1-cp310-none-win_amd64.whl (118.8 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

scikit_bloom-0.2.1-cp310-none-win32.whl (116.4 kB view hashes)

Uploaded CPython 3.10 Windows x86

scikit_bloom-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

scikit_bloom-0.2.1-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ s390x

scikit_bloom-0.2.1-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.2 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ppc64le

scikit_bloom-0.2.1-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.1 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARMv7l

scikit_bloom-0.2.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

scikit_bloom-0.2.1-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.whl (1.2 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.5+ i686

scikit_bloom-0.2.1-cp310-cp310-macosx_11_0_arm64.whl (243.5 kB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

scikit_bloom-0.2.1-cp310-cp310-macosx_10_12_x86_64.whl (242.5 kB view hashes)

Uploaded CPython 3.10 macOS 10.12+ x86-64

scikit_bloom-0.2.1-cp39-none-win_amd64.whl (119.0 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

scikit_bloom-0.2.1-cp39-none-win32.whl (116.6 kB view hashes)

Uploaded CPython 3.9 Windows x86

scikit_bloom-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

scikit_bloom-0.2.1-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ s390x

scikit_bloom-0.2.1-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.2 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ppc64le

scikit_bloom-0.2.1-cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.1 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARMv7l

scikit_bloom-0.2.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

scikit_bloom-0.2.1-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.whl (1.2 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.5+ i686

scikit_bloom-0.2.1-cp38-none-win_amd64.whl (118.5 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

scikit_bloom-0.2.1-cp38-none-win32.whl (116.2 kB view hashes)

Uploaded CPython 3.8 Windows x86

scikit_bloom-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

scikit_bloom-0.2.1-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ s390x

scikit_bloom-0.2.1-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.2 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ppc64le

scikit_bloom-0.2.1-cp38-cp38-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.1 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARMv7l

scikit_bloom-0.2.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

scikit_bloom-0.2.1-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.whl (1.2 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.5+ i686

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page