Python llama.cpp HTTP Server and LangChain LLM Client
Project description
python-llama-cpp-http
Python HTTP Server and LangChain LLM Client for llama.cpp.
Server has only two routes:
- call: for a prompt get whole text completion at once:
POST
/api/1.0/text/completion
- stream: for a prompt get text chunks via WebSocket:
GET
/api/1.0/text/completion
LangChain LLM Client has support for sync calls only based on Python packages requests
and websockets
.
Install
pip install llama_cpp_http
Manual install
Assumption is that GPU driver, and OpenCL / CUDA libraries are installed.
Make sure you follow instructions from LLAMA_CPP.md
below for one of following:
- CPU - including Apple, recommended for beginners
- OpenCL for AMDGPU/NVIDIA CLBlast
- HIP/ROCm for AMDGPU hipBLAS,
- CUDA for NVIDIA cuBLAS
It is the easiest to start with just CPU-based version of llama.cpp if you do not want to deal with GPU drivers and libraries.
Install build packages
- Arch/Manjaro:
sudo pacman -Sy base-devel python git jq
- Debian/Ubuntu:
sudo apt install build-essential python3-dev python3-venv python3-pip libffi-dev libssl-dev git jq
Clone repo
git clone https://github.com/mtasic85/python-llama-cpp-http.git
cd python-llama-cpp-http
Make sure you are inside cloned repo directory python-llama-cpp-http
.
Setup python venv
python -m venv venv
source venv/bin/activate
python -m ensurepip --upgrade
pip install -U .
Clone and compile llama.cpp
git clone https://github.com/ggerganov/llama.cpp llama.cpp
cd llama.cpp
make -j
Download Meta's Llama 2 7B Model
Download GGUF model from https://huggingface.co/TheBloke/Llama-2-7B-GGUF to local directory models
.
Our advice is to use model https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q2_K.gguf with minimum requirements, so it can fit in both RAM/VRAM.
Run Server
python -m llama_cpp_http.server --backend cpu --models-path ./models --llama-cpp-path ./llama.cpp
Run Client Examples
- Simple text completion call
/api/1.0/text/completion
:
python -B misc/example_client_call.py | jq .
- WebSocket stream
/api/1.0/text/completion
:
python -B misc/example_client_stream.py | jq -R '. as $line | try (fromjson) catch $line'
Licensing
python-llama-cpp-http is licensed under the MIT license. Check the LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for llama_cpp_http-0.2.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e40b85f0846d76a29617dc6fe0d1ddb8cb190a588ecbac585c22dda0f0733f76 |
|
MD5 | 33bf54545f44605de7af63cdbc36bb6d |
|
BLAKE2b-256 | 298cd54c33388da2a8eec78f0045b51dadc2365afd179f42495c7e426d702e62 |