Boost large language model inference performance on CPU platform.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

xFasterTransformer

xFasterTransformer is an exceptionally optimized solution for large language models (LLM) on the X86 platform, which is similar to FasterTransformer on the GPU platform. xFasterTransformer is able to operate in distributed mode across multiple sockets and nodes to support inference on larger models. Additionally, it provides both C++ and Python APIs, spanning from high-level to low-level interfaces, making it easy to adopt and integrate.

xFasterTransformer

Models overview

Large Language Models (LLMs) develops very fast and are more widely used in many AI scenarios. xFasterTransformer is an optimized solution for LLM inference using the mainstream and popular LLM models on Xeon. xFasterTransformer fully leverages the hardware capabilities of Xeon platforms to achieve the high performance and high scalability of LLM inference both on single socket and multiple sockets/multiple nodes.

xFasterTransformer provides a series of APIs, both of C++ and Python, for end users to integrate xFasterTransformer into their own solutions or services directly. Many kinds of example codes are also provided to demonstrate the usage. Benchmark codes and scripts are provided for users to show the performance. Web demos for popular LLM models are also provided.

Support matrix

Models	Framework		Distribution	DataType
	PyTorch	C++		FP16	BF16	INT8	BF16+FP16	BF16+INT8
ChatGLM	✔	✔	✔	✔	✔	✔	✔	✔
ChatGLM2	✔	✔	✔	✔	✔	✔	✔	✔
Llama	✔	✔	✔	✔	✔	✔	✔	✔
Llama2	✔	✔	✔	✔	✔	✔	✔	✔
Opt	✔	✔	✔	✔	✔	✔	✔	✔

Documents

xFasterTransformer Documents provides the following resources:

An introduction to xFasterTransformer.
Comprehensive API references for both high-level and low-level interfaces in C++ and PyTorch.
Practical API usage examples for xFasterTransformer in both C++ and PyTorch.

Installation

From PyPI

pip install xfastertransformer

Using Docker

docker pull intel/xfastertransformer:latest

Built from source

Prepare Environment

Manually

oneCCL
- Use provided scripts to build it from source code.
```
cd 3rdparty
sh prepare_oneccl.sh
source ./oneCCL/build/_install/env/setvars.sh
```
- Install oneCCL through installing Intel® oneAPI Base Toolkit.
PyTorch v2.0+ (When using the PyTorch API, it's required, but it's not needed when using the C++ API.)
```
pip install torch --index-url https://download.pytorch.org/whl/cpu
```

Docker(Recommended)

Pull docker image from dockerhub

docker pull intel/xfastertransformer:dev-ubuntu22.04

Build docker image from Dockerfile

docker build \
-f dockerfiles/Dockerfile \
--build-arg "HTTP_PROXY=${http_proxy}" \
--build-arg "HTTPS_PROXY=${https_proxy}" \
-t intel/xfastertransformer:dev-ubuntu22.04 .

Then run the docker with the command or bash script in repo (Assume model files are in /data/ directory):

# A new image will be created to ensure both the user and file directories are consistent with the host if the user is not root.
bash run_dev_docker.sh

# or run docker manually by following command.
docker run -it \
    --name xfastertransformer-dev \
    --privileged \
    --shm-size=16g \
    -v "${PWD}":/root/xfastertransformer \
    -v /data/:/data/ \
    -w /root/xfastertransformer \
    -e "http_proxy=$http_proxy" \
    -e "https_proxy=$https_proxy" \
    intel/xfastertransformer:dev-ubuntu22.04

Notice!!!: Please enlarge --shm-size if bus error occurred while running in the multi-ranks mode . The default docker limits the shared memory size to 64MB and our implementation uses many shared memories to achieve a better performance.

How to build

Using 'CMake'

# Build xFasterTransformer
git clone https://github.com/intel/xFasterTransformer.git xFasterTransformer
cd xFasterTransformer
# Please make sure torch is installed when run python example
mkdir build && cd build
cmake ..
make -j

Using 'python setup.py'

# Build xFasterTransformer library and C++ example.
python setup.py build

# Install xFasterTransformer into pip environment.
python setup.py install

Models Preparation

xFasterTransformer supports a different model format from Huggingface, but it's compatible with FasterTransformer's format.

Download the huggingface format model firstly.
After that, convert the model into xFasterTransformer format using the script in 'tools' folder. You will see many bin files in the output directory.

    python ./tools/chatglm_convert.py -i ${HF_DATASET_DIR} -o  ${OUTPUT_DIR}

API usage

For more details, please see API document and examples.

Python API(PyTorch)

Firstly, please install the dependencies.

Python dependencies
```
pip install -r requirements.txt
```
oneCCL
Install oneCCL and setup the environment. Please refer to Prepare Environment.

xFasterTransformer's Python API is similar to transformers and also supports transformers's streamer to achieve the streaming output. In the example, we use transformers to encode input prompts to token ids.

import xfastertransformer
from transformers import AutoTokenizer, TextStreamer
# Assume huggingface model dir is `/data/chatglm-6b-hf` and converted model dir is `/data/chatglm-6b-cpu`.
MODEL_PATH="/data/chatglm-6b-cpu"
TOKEN_PATH="/data/chatglm-6b-hf"

INPUT_PROMPT = "Once upon a time, there existed a little girl who liked to have adventures."
tokenizer = AutoTokenizer.from_pretrained(TOKEN_PATH, use_fast=False, padding_side="left", trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True, skip_prompt=False)

input_ids = tokenizer(INPUT_PROMPT, return_tensors="pt", padding=False).input_ids
model = xfastertransformer.AutoModel.from_pretrained(MODEL_PATH, dtype="bf16")
generated_ids = model.generate(input_ids, max_length=200, streamer=streamer)

C++ API

SentencePiece can be used to tokenizer and detokenizer text.

#include <vector>
#include <iostream>
#include "xfastertransformer.h"
// ChatGLM token ids for prompt "Once upon a time, there existed a little girl who liked to have adventures."
std::vector<int> input(
        {3393, 955, 104, 163, 6, 173, 9166, 104, 486, 2511, 172, 7599, 103, 127, 17163, 7, 130001, 130004});

// Assume converted model dir is `/data/chatglm-6b-cpu`.
xft::AutoModel model("/data/chatglm-6b-cpu", xft::DataType::bf16);

model.config(/*max length*/ 100, /*num beams*/ 1);
model.input(/*input token ids*/ input, /*batch size*/ 1);

while (!model.isDone()) {
    std::vector<int> nextIds = model.generate();
}

std::vector<int> result = model.finalize();
for (auto id : result) {
    std::cout << id << " ";
}
std::cout << std::endl;

How to run

Recommend preloading libiomp5.so to get a better performance. libiomp5.so file will be in 3rdparty/mklml/lib directory after building xFasterTransformer successfully.

Single rank

Recommend using SINGLE_INSTANCE=1 env to avoid MPI initialization.

Multi ranks

Command line

Use MPI to run in the multi-ranks mode. Here is a example on local. Install oneCCL firstly, please refer to Prepare Environment.

OMP_NUM_THREADS=48 LD_PRELOAD=libiomp5.so mpirun \
  -n 1 numactl -N 0  -m 0 ${RUN_WORKLOAD} : \
  -n 1 numactl -N 1  -m 1 ${RUN_WORKLOAD}

Code

For more details, please refer to examples.

Python

model.rank can get the process's rank, model.rank == 0 is the Master.
For Slaves, after loading the model, the only thing needs to do is model.generate(). The input and generation configuration will be auto synced.

model = xfastertransformer.AutoModel.from_pretrained(MODEL_PATH, dtype="bf16")

# Slave
while True:
    model.generate()

C++

model.getRank() can get the process's rank, model.getRank() == 0 is the Master.
For Slaves, any value can be input to model.config() and model.input since Master's value will be synced.

xft::AutoModel model("/data/chatglm-6b-cpu", xft::DataType::bf16);

// Slave
while (1) {
    model.config();
    std::vector<int> input_ids;
    model.input(/*input token ids*/ input_ids, /*batch size*/ 1);

    while (!model.isDone()) {
        model.generate();
    }
}

Web Demo

A web demo based on Gradio is provided in repo. Now support ChatGLM, ChatGLM2 and Llama2 models.

Perpare the model.

Install the dependencies

pip install -r examples/web_demo/requirements.txt

Run the script corresponding to the model. After the web server started, open the output URL in the browser to use the demo. Please specify the paths of model and tokenizer directory, and data type. transformer's tokenizer is used to encode and decode text so ${TOKEN_PATH} means the huggingface model directory. This demo also support multi-rank.

# Recommend preloading `libiomp5.so` to get a better performance.
# `libiomp5.so` file will be in `3rdparty/mklml/lib` directory after build xFasterTransformer.
SINGLE_INSTANCE=1 LD_PRELOAD=libiomp5.so python examples/web_demo/ChatGLM.py \
                                                     --dtype=bf16 \
                                                     --token_path=${TOKEN_PATH} \
                                                     --model_path=${MODEL_PATH}

Benchmark

Benchmark scripts are provided to get the model inference performance quickly.

Prepare the model.
Enter the folder corresponding to the model, for example
```
cd benchmark/chatglm6b/
```
Run scripts run_${MODEL}.sh. Please modify the model and tokenizer path in ${MODEL}.sh before running.
- Shell script will automatically check the number of numa nodes. By default, at least there are 2 nodes and 48 physics cores per node (If the system is in sub-numa status, there are 12 cores for each sub-numa).
- By default, you will get the performance of "input token=32, output token=32, Beam_width=1, FP16".
- If more datatype and scenarios performance needed, please modify the parameters in ${MODEL}.sh
- If system configuration needs modification, please change run-chatglm-6b.sh.
- If you want the custom input, please modify the prompt_pool.json file.

Notes!!!: The system and CPU configuration may be different. For the best performance, please try to modify OMP_NUM_THREADS, datatype and the memory nodes number (check the memory nodes using numactl -H) according to your test environment.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.6.0

Apr 28, 2024

1.5.0

Apr 15, 2024

1.4.0

Mar 8, 2024

1.3.1

Jan 24, 2024

1.3.0

Jan 23, 2024

1.2.0

Dec 22, 2023

1.1.0

Dec 1, 2023

This version

1.0.0

Oct 17, 2023

0.0.0.dev0 pre-release yanked

Oct 12, 2023

Reason this release was yanked:

This was just a test upload to create the project

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

xfastertransformer-1.0.0-py3-none-any.whl (20.6 MB view hashes)

Uploaded Oct 17, 2023 Python 3

Hashes for xfastertransformer-1.0.0-py3-none-any.whl

Hashes for xfastertransformer-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3e55d1828f4262c1f5f6831330a60a0fb9a341730a440ddda0656d0a73b6e0e4`
MD5	`c42b964d479ac449ad2a845f71530dd7`
BLAKE2b-256	`c3f79cf07f9c60b8fbf9197f9aba12909d625db8722a6dd4565c80090b970f3c`

xfastertransformer 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Project description

xFasterTransformer

Table of Contents

Models overview

Support matrix

Documents

Installation

From PyPI

Using Docker

Built from source

Prepare Environment

Manually

Docker(Recommended)

How to build

Models Preparation

API usage

Python API(PyTorch)

C++ API

How to run

Single rank

Multi ranks

Command line

Code

Python

C++

Web Demo

Benchmark

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution