Fast and Customizable Tokenizers
Project description
Tokenizers
A fast and easy to use implementation of today's most used tokenizers.
- High Level design: master
This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.
Installation
With pip:
pip install tokenizers
From sources:
To use this method, you need to have the Rust nightly toolchain installed.
# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"
# Or select the right toolchain:
rustup default nightly-2019-11-01
Once Rust is installed and using the right toolchain you can do the following.
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate
# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release
Usage
Use a pre-trained tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders
# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)
# Initialize a tokenizer
tokenizer = Tokenizer(bpe)
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
"I can feel the magic, can you?",
"The quick brown fox jumps over the lazy dog"
])
print(encoded)
Train a new tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new())
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
"./path/to/dataset/1.txt",
"./path/to/dataset/2.txt",
"./path/to/dataset/3.txt"
])
# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tokenizers-0.0.7.tar.gz
(32.6 kB
view hashes)
Built Distributions
Close
Hashes for tokenizers-0.0.7-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 94070b2a4cdef78978d8fd077525876344e98e9bd09274e76b8297768628ddae |
|
MD5 | 29b32588facd73ff061e3a49cc975179 |
|
BLAKE2b-256 | 275d95b1f501143bce0577017ba09cc883215fac770286209a5ced2e7f8d97b5 |
Close
Hashes for tokenizers-0.0.7-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a7d36cd94689ac3dfcbd7d161dd2b42b2398667edd05f25c993fb4057b1b4ecd |
|
MD5 | 6463c42cc778c32d39a48ed7af6171a9 |
|
BLAKE2b-256 | 70c016fae2d4e860397191ea63b3fee43b4878882aecbaf85812d9f1dac1b7a4 |
Close
Hashes for tokenizers-0.0.7-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f253b1666a5a960eae5a13241acea5cd776f11866a7bc09d5953e1683d6bc3f |
|
MD5 | 549822141905621fd439f20f8887066c |
|
BLAKE2b-256 | 223e267b0d62b214330d171dc79eefd1d2c552b74f03f1d86d48c002ea8dc2b3 |
Close
Hashes for tokenizers-0.0.7-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | da58fd46f9f46a2812f4a4f886b3dec4baa0f64c46f2792a354f659efcc745a2 |
|
MD5 | ebbb53466f03b679b481a48248d7665c |
|
BLAKE2b-256 | a69bd0622be89041fb2b1f115fd4ad69ead75280128c116b23806d860611f9c8 |
Close
Hashes for tokenizers-0.0.7-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5832a856c3d0e3f08f526d2c825ae1d733e931d40372612d6a0b31829b272dca |
|
MD5 | ce355e64d8d1f065e9bd66fb740169e3 |
|
BLAKE2b-256 | 06406aa6103718f927273fe94df66d1dc4f0771c7afbac573e16c17f2814607e |
Close
Hashes for tokenizers-0.0.7-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 662cea8038652353960704576c565506a9a9d78d8d8ce409d235e9799fe9e6d6 |
|
MD5 | 4dfee56363a6c21273d2002011a22e1e |
|
BLAKE2b-256 | 9ddff1e2164cd87f6fdf45d5279886c4cd0733345516ad4c15a87a33fdcaea96 |
Close
Hashes for tokenizers-0.0.7-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a67858e2d765c7a414dfcd484d699ea18a36e1432a56aedc1dc32c307480473 |
|
MD5 | d66a3714c7b9203aa6b3eacd1accc6bd |
|
BLAKE2b-256 | ee7bac6ac6e101294cd862c7df3c95c2c968da2230f7c6d69a350209cd9b11f4 |
Close
Hashes for tokenizers-0.0.7-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3518c00e395b436edf874a68b1f74d819283dece070c5ec57d811d9c62e13d0f |
|
MD5 | ce4425657607f533c913603647ca71d7 |
|
BLAKE2b-256 | f85d35429295b4f6fa8c7a5c9c2d15a8f3107e7a5112388d9047cb9bcc1d0e92 |
Close
Hashes for tokenizers-0.0.7-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cd976ca6cf69cce6b15429d6188963509f2a39f609050607b170015a0b02049f |
|
MD5 | 8dc5649268139e71d4f1c6be66c570ca |
|
BLAKE2b-256 | f8f19b05494ddecfc7d885806fbf021cf9546ebc7d2e2e27916aa56eb90bf8ff |
Close
Hashes for tokenizers-0.0.7-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5be48cc7d889fd7f8f66066273669f1df724410de06114c94bb457cb88cc913f |
|
MD5 | e846d4ff68e96c84446be49c69f12201 |
|
BLAKE2b-256 | 0827a6005cad6acbe053b836edaceee14bfbe6e49a8d48203853083b7d7cda1b |
Close
Hashes for tokenizers-0.0.7-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ae5960425b9422546bdecb755f43956cc828bbbf6dcd5bcec7a7153080e6c160 |
|
MD5 | aca1e61022eea1f1006a9cb38c53b8b7 |
|
BLAKE2b-256 | a9a90d9401615d29b066cab3403e7d8395e407c7c69c191bc582e848342fa7ed |
Close
Hashes for tokenizers-0.0.7-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ff6f0b26e5dedfb86c732fbd6aaa8368cfc1df183ca5064b45b75d1382c4dc0d |
|
MD5 | d9ff930784618df8afdea3ecf32f9b80 |
|
BLAKE2b-256 | 8ec75b89b4af6f53770f2e6aa73ee79a1eccfc3cdabce7114a79d27efab6c8d3 |