A quantization toolkit for pytorch.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Quanto

DISCLAIMER: this package is still an early prototype (pre-beta version), and not (yet) an HuggingFace product. Expect breaking changes and drastic modifications in scope and features.

🤗 Quanto is a python quantization toolkit that provides several features that are either not supported or limited by the base pytorch quantization tools:

all features are available in eager mode (works with non-traceable models),
quantized models can be placed on any device (including CUDA),
automatically inserts quantization and dequantization stubs,
automatically inserts quantized functional operations,
automatically inserts quantized modules (see below the list of supported modules),
provides a seamless workflow from float model to dynamic to static quantized model,
supports quantized model serialization as a state_dict.

Features yet to be implemented:

quantize clone (quantization happens in-place for now),
optimized integer kernels,
quantized operators fusion,
support int4 weights,
compatibility with torch compiler (aka dynamo).

Supported modules

The following modules can be quantized:

Linear (QLinear). Weights are quantized to int8, adn biases to int32. Outputs are quantized to int8.

The next modules to be implemented are normalization layers, to allow the quantization of attention blocks:

LayerNorm,
LLamaRMSNorm.

Limitations and design choices

Quanto uses a strict affine quantization scheme (no zero-point).

Quanto does not support mixed-precision quantization.

Although Quanto uses integer activations and weights, the current implementation falls back to float32 operations for integer inputs, which means that no benefits are expected in terms of latency (weight storage and on-device memory usage should be lower).

Installation

Quanto is available as a pip package.

pip install quanto

Quantization workflow

Quanto does not make a clear distinction between dynamic and static quantization: models are always dynamically quantized, but their weights can later be "frozen" to integer values.

A typical quantization workflow would consist in the following steps:

Quantize

The first step converts a standard float model into a dynamically quantized model.

quantize(model)

Calibrate (optional)

Activations are quantized using a default [-1, 1] range which can lead to severe clipping and/or inaccurate values.

Quanto supports a calibration mode that allows to adjust the activation ranges while passing representative samples through the quantized model.

with calibration():
    model(samples)

Note that during calibration, all activations and weights are dequantized and inference happens with float precision.

Tune, aka Quantization-Aware-Training (optional)

If the performances of the model are too degraded, one can tune it for a few epochs to recover the float model performances.

model.train()
for batch_idx, (data, target) in enumerate(train_loader):
    data, target = data.to(device), target.to(device)
    optimizer.zero_grad()
    output = model(data).dequantize()
    loss = torch.nn.functional.nll_loss(output, target)
    loss.backward()
    optimizer.step()

Freeze integer weights

When freezing a model, its float weights are replaced by quantized integer weights.

freeze(model)

Please refer to the examples for instantiations of that worklow.

Implementation details

Under the hood, Quanto uses a torch.Tensor subclass (QTensor) to dispatch aten base operations to integer operations.

All integer operations accept QTensor with int8 data.

Most arithmetic operations return a QTensor with int32 data.

In addition to the quantized tensors, Quanto uses quantized modules as substitutes to some base torch modules to:

store quantized weights,
gather input and output scales to rescale QTensor int32 data to int8.

Eventually, the produced quantized graph should be passed to a specific inductor backend to fuse rescale into the previous operation.

Examples of fused operations can be found in https://github.com/Guangxuan-Xiao/torch-int.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.2.0

May 24, 2024

0.1.0

Mar 13, 2024

0.0.13

Feb 23, 2024

0.0.12

Feb 16, 2024

0.0.11

Jan 19, 2024

0.0.10

Dec 20, 2023

0.0.9

Dec 15, 2023

0.0.8

Dec 8, 2023

0.0.7

Dec 1, 2023

0.0.6

Oct 27, 2023

0.0.5

Oct 19, 2023

This version

0.0.4

Oct 9, 2023

0.0.3

Oct 6, 2023

0.0.2

Oct 6, 2023

0.0.1

Sep 28, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quanto-0.0.4.tar.gz (14.6 kB view hashes)

Uploaded Oct 9, 2023 Source

Built Distribution

quanto-0.0.4-py3-none-any.whl (12.7 kB view hashes)

Uploaded Oct 9, 2023 Python 3

Hashes for quanto-0.0.4.tar.gz

Hashes for quanto-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`76adb94b52495cdf8dcc5a79dbe040538a0950b226381c8ccbc9e10ca9d90f11`
MD5	`bf2ce3030f86e2ba62944a091edfab76`
BLAKE2b-256	`c9282256358e1a5335497aa4683b55bdf8eb8856ac686388026a4bbd3f852201`

Hashes for quanto-0.0.4-py3-none-any.whl

Hashes for quanto-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`26df90cc4bd7ae47426a3694f0c81000b2b5fdb98bd7791ea398e1a8c02cfae8`
MD5	`e39716c69ad3850efcf0988637a2566a`
BLAKE2b-256	`dc3a45d14d1660c7def305866c8f104709aa28ee337a3d6b9cc1bd8f535cbd84`