Fast and Customizable Tokenizers
Project description
Tokenizers
A fast and easy to use implementation of today's most used tokenizers.
- High Level design: master
This API is currently in the process of being stabilized. We might introduce breaking changes really often in the coming days/weeks, so use at your own risks.
Installation
With pip:
pip install tokenizers
From sources:
To use this method, you need to have the Rust nightly toolchain installed.
# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -default-toolchain nightly-2019-11-01 -y
export PATH="$HOME/.cargo/bin:$PATH"
# Or select the right toolchain:
rustup default nightly-2019-11-01
Once Rust is installed and using the right toolchain you can do the following.
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate
# Install `tokenizers` in the current virtual env
pip install maturin
maturin develop --release
Usage
Use a pre-trained tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders
# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)
# Initialize a tokenizer
tokenizer = Tokenizer(bpe)
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(add_prefix_space=True))
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
"I can feel the magic, can you?",
"The quick brown fox jumps over the lazy dog"
])
print(encoded)
Train a new tokenizer
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())
# Customize pre-tokenization and decoding
tokenizer.with_pre_tokenizer(pre_tokenizers.ByteLevel.new(add_prefix_space=True))
tokenizer.with_decoder(decoders.ByteLevel.new())
# And then train
trainer = trainers.BpeTrainer.new(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
"./path/to/dataset/1.txt",
"./path/to/dataset/2.txt",
"./path/to/dataset/3.txt"
])
# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tokenizers-0.0.12.tar.gz
(50.4 kB
view hashes)
Built Distributions
Close
Hashes for tokenizers-0.0.12-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 703bb4210252e9cb148e4586fe764b50c3e059584bb72e913254b85ee5c94cb3 |
|
MD5 | 4a5883f2a929390b6549514a63c85891 |
|
BLAKE2b-256 | 6267e16c348eee103620a73b4fbc85676bee32c5b3cf4aed31c8bab1111ec366 |
Close
Hashes for tokenizers-0.0.12-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1b5b48303ff414577682cd02a322e90b5ba6be466af7930dd1db95044bec40eb |
|
MD5 | d6f8e8f968aad6dc042d5d1e21e26707 |
|
BLAKE2b-256 | 539f5a89839e9e5b166ea783250a17affa5a22631348994b60547ab9d6fec82c |
Close
Hashes for tokenizers-0.0.12-cp38-cp38-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | faa03757be6672861b2090ac7bee49edae17aa911fdea41e573e6da2cd870968 |
|
MD5 | 08455cfa3ae127cb7fea5ee8e114582d |
|
BLAKE2b-256 | f423ebd2a17becf1f1a9d2d855fe660eed2b4ef6a3d8c5d2b28a6812aa05096c |
Close
Hashes for tokenizers-0.0.12-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b468b61f922bc60e53970c09efa62452ac2e4772ca6ef1f3e55c0f93162ab61c |
|
MD5 | 58ee63d150d48fc7e4e29cbcaa662a3e |
|
BLAKE2b-256 | b4a9e3bd033305c94c4540ac5dc4c50f9b23c872475568df9fd88600929aa0bd |
Close
Hashes for tokenizers-0.0.12-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f973d79fef2d135b02536dca98d00005dca85fbe43ff399e15c8f893d49f557 |
|
MD5 | 5c9e2de48053bb10dbc942a1053edea3 |
|
BLAKE2b-256 | bd6e080558cffd543c8dfd0e846562c683d03bd83e7b18736dde31a2bd0ada2d |
Close
Hashes for tokenizers-0.0.12-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 245b7c5d5205925723433fb25d07f041a60ea1330b83881cbe94af59731d7bb5 |
|
MD5 | d6d8e8befcc17d2abf08be3c68b8f463 |
|
BLAKE2b-256 | 8efbd61c21758aecf258d0f90c146f59cfe40df8f151a402c37e0b679952c544 |
Close
Hashes for tokenizers-0.0.12-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5054d4c3ce6a36ec8ab9b085846893cd7329718720f7afd4ca31d52c83f46c13 |
|
MD5 | c6a2545445d8f67c26a1b997ae79df09 |
|
BLAKE2b-256 | e34f3fd38e985981b605a4e34629b156a0056b1340ebab67474b365445a7c9d1 |
Close
Hashes for tokenizers-0.0.12-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6cd473a079775f4d6d468cab269feb1e4c980f725586ea3c15374e24255604db |
|
MD5 | 121ed1a093a2d9ea4b2e51c661beae14 |
|
BLAKE2b-256 | a167a07f5b0f5ec2a92062ad728826b68d57610e5602a938db85f81a8f782086 |
Close
Hashes for tokenizers-0.0.12-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f0cf9c1f8cb1b0ae91be371e2e35dbbd735a459247e967283a60c310f21a0e9 |
|
MD5 | 270659e7cfea98453fec3d584d0aa67e |
|
BLAKE2b-256 | 556a6aaaeff6e2260b8172433b466b6c389c98ea9322ddf79db530dd56987b07 |
Close
Hashes for tokenizers-0.0.12-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb612b0e856f5dbae4e1d8d22ae4767d0ee25617f6119e993c8a40a428fe0a00 |
|
MD5 | b5f618e44987b19c20d863f70cf7ffa8 |
|
BLAKE2b-256 | 4dd3c48fb353616f227804b34c8ee757cfdfae3fbebcd0fbd360375a264e5d47 |
Close
Hashes for tokenizers-0.0.12-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f94a66c16c8eca02c1920398ef9b3ee19b0aaf93100b2422bb9d2e0800d721b2 |
|
MD5 | 6f36e138eb2e1e74a8758239ada27ba1 |
|
BLAKE2b-256 | 4019935f78893986824542a7b72294f0a3794b57b441950fda1194014f071197 |
Close
Hashes for tokenizers-0.0.12-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 91573509e1ae48f4a28910e2d2c675a682ac3d79b9739ef00dcb1672b46d6a6a |
|
MD5 | ae656f7c71256b202346c71c60fb2cd0 |
|
BLAKE2b-256 | 0f6a89dc07b67b518cc3f130048b078b4041b77805f104b183bc2b6377f089db |