Skip to main content

Text Generation Inference

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-access LLMs. Among other features, it has quantization, tensor parallelism, token streaming, continuous batching, flash attention, guidance, and more.

The easiest way of getting started with TGI is using the official Docker container.

Deploying

model=mistralai/Mistral-7B-Instruct-v0.3
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
ghcr.io/huggingface/text-generation-inference:2.0.3 \
--model-id $model

This will spawn a TGI instance exposing an OpenAI-like API, as documented in the API section.

Make sure to set the HUGGING_FACE_HUB_TOKEN environment variable to your Hugging Face user access token. To use Mistral models, you must first visit the corresponding model page and fill out the small form. You then automatically get access to the model.

If the model does not fit in your GPU, you can also use quantization methods (AWQ, GPTQ, etc.). You can find all TGI launch options at their documentation.

Using the API

With chat-compatible endpoint

TGI supports the Messages API which is compatible with Mistral and OpenAI Chat Completion API.

from mistralai.client import MistralClient
from mistralai.models.chat_completion import ChatMessage

# init the client but point it to TGI
client = MistralClient(api_key="-", endpoint="http://127.0.0.1:8080")
chat_response = client.chat(
model="-",
messages=[
ChatMessage(role="user", content="What is the best French cheese?")
]
)

print(chat_response.choices[0].message.content)

Using a generate endpoint

If you want more control over what you send to the server, you can use the generate endpoint. In this case, you're responsible of formatting the prompt with the correct template and stop tokens.

# Make sure to install the huggingface_hub package before
from huggingface_hub import InferenceClient

client = InferenceClient(model="http://127.0.0.1:8080")
client.text_generation(prompt="What is Deep Learning?")