Fast and Customizable Tokenizers
Project description
Tokenizers
A fast and easy to use implementation of today's most used tokenizers.
- High Level design: master
This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.
Installation
With pip:
pip install tokenizers
From sources:
To use this method, you need to have the Rust nightly toolchain installed.
# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"
# Or select the right toolchain:
rustup default nightly-2019-11-01
Once Rust is installed and using the right toolchain you can do the following.
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate
# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release
Usage
Use a pre-trained tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders
# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)
# Initialize a tokenizer
tokenizer = Tokenizer(bpe)
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
"I can feel the magic, can you?",
"The quick brown fox jumps over the lazy dog"
])
print(encoded)
Train a new tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
"./path/to/dataset/1.txt",
"./path/to/dataset/2.txt",
"./path/to/dataset/3.txt"
])
# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tokenizers-0.0.6.tar.gz
(31.2 kB
view hashes)
Built Distributions
Close
Hashes for tokenizers-0.0.6-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea01fb5daf48c8fc4677e2c972f3ab9d67de6f1e50e38b0224430e0db287ebec |
|
MD5 | 097fed7b49aadbc35ec87f025ae0b19b |
|
BLAKE2b-256 | c2e30fd7873e5905616628d43e101bcc26013f12f6d6ab5ab15ca6509584b77b |
Close
Hashes for tokenizers-0.0.6-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3a4b4e45df54676aea849c10971d000d7eb86be68eaeebfdaf74221d6d99e447 |
|
MD5 | dd83dc270d03c80f927c8685daf9e146 |
|
BLAKE2b-256 | 03ea6165ea369dc06eaf3aa4dc15db217f8b86010b06af3e925f0ce20b8248eb |
Close
Hashes for tokenizers-0.0.6-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb241dc10061ed3f380dd5a2404e1257e291e12f91c7ee48c52cb5aff74cf516 |
|
MD5 | 69430dc6c1d172b6b777ec4e8fe6d574 |
|
BLAKE2b-256 | ff135a13f79f6d8009017803c21115fa692a1e98ed1a34ffc4e03be0d4ed0eac |
Close
Hashes for tokenizers-0.0.6-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5233d5cb9ff35d90a831431cbe2d7a506cbec8953ffd723fa8b9607819f82377 |
|
MD5 | a930d7dfbc99111f08dc9d76b3f90cc4 |
|
BLAKE2b-256 | 7d5c75d3232254879a460ac8bbceb5a1b7112b2b192427d574eeac8e5570185a |
Close
Hashes for tokenizers-0.0.6-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 082be3c04618e9f2e1cccd87e76a2502249e8898690c7a2ba3c3f7b99ded6f31 |
|
MD5 | 3d413bc36c57df5614d2076a6409386d |
|
BLAKE2b-256 | 76926ab43d5edee826c02c5279389e103814f499adbe747b4937badef694a3af |
Close
Hashes for tokenizers-0.0.6-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2ef3621bbbe864a78b63cf1c2e0706dc24f32dfc413d6c4056a9d1a4e2bc655d |
|
MD5 | f0e57926dc5fc2c181a20776e747a0e6 |
|
BLAKE2b-256 | a43d925e69390e1eeaf212f2ddf82e88feb000f4bb6f8245921a9e6e4921bd75 |
Close
Hashes for tokenizers-0.0.6-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38631afca82d91c667a78a08ad222eaff576e0838b406e32bd82900257c158ba |
|
MD5 | 47cc35e5ed684e74abe194b55e1a718a |
|
BLAKE2b-256 | a7c0034522b4c4f62cb7f580af422f5e1a01033f448c7f3eed702a6d1faacf61 |
Close
Hashes for tokenizers-0.0.6-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fec9e67e9c5b32db3a3f2cb11980a494dfb5cff1d5cc842400d66bfa6fc821bf |
|
MD5 | 5674e20f1e48bf29b5afd4fe48995f35 |
|
BLAKE2b-256 | 34a8ef2b2c87a91122d7dea9be77f28f4012f934dbeaeb0957655aa2cd8a4a4b |
Close
Hashes for tokenizers-0.0.6-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c1157f8afb554b65efd02d2b201411170c055810a47225ab4ae756559d6f73a3 |
|
MD5 | cc2775f576e3906470da27ed21ab1e2e |
|
BLAKE2b-256 | c039d8caabcc7ea7e3a67b76c19118fd5021a16d4b28cc3d990878a11b69da31 |
Close
Hashes for tokenizers-0.0.6-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bfb41ff41cbae600f69d9936e144c831a1c40d8d5f616d93220f14aebb299963 |
|
MD5 | 1e00cec06437c0bc38aab6e87171f235 |
|
BLAKE2b-256 | c55ca42054d2febc634c507a89280154c4ac3b876dba580ca405f1cdeb0f4bfb |
Close
Hashes for tokenizers-0.0.6-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5ad69b2919027c0611a8fd7047405fdf7cd13d02c3cccd393eb39605bbb1058e |
|
MD5 | 0297a26a5699fec451d4fa204503ea2d |
|
BLAKE2b-256 | ebf8c204c8f70126475a4a24591f9fd79f6510054ff887efda9a7bf4a2fa1f82 |
Close
Hashes for tokenizers-0.0.6-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6314539f2e7ba1f1c8b00dc0fdd17b41cfab684d4b0cfd388305fd556f10101 |
|
MD5 | d660f57583e274b5c62cadc5d5d7a3e7 |
|
BLAKE2b-256 | 469549b719e0b8a08d838304f597693ca3c613f99cf94b4d0c71661f3cf6bee1 |