Skip to main content

vLLM

vLLM can be deployed using a docker image we provide, or directly from the python package.

info

If you are deploying a given model for the first time, you will first need to go to the model's card page on the HuggingFace website then accept the conditions of access.

This is a one-time operation for each model and does not affect their license terms.

With docker

On a GPU-enabled host, you can run the Mistral AI LLM Inference image with the following command to download the model from Hugging Face:

docker run --gpus all \
-e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
ghcr.io/mistralai/mistral-src/vllm:latest \
--host 0.0.0.0 \
--model mistralai/Mistral-7B-Instruct-v0.2

Where HF_TOKEN is an environment variable containing your Hugging Face user access token. This will spawn a vLLM instance exposing an OpenAI-like API, as documented in the API section.

info

If your GPU has CUDA capabilities below 8.0, you will see the error ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your XXX GPU has compute capability 7.0. You need to pass the parameter --dtype half to the Docker command line.

The dockerfile for this image can be found on our reference implementation github.

Without docker

Alternatively, you can directly spawn a vLLM server on a GPU-enabled host with Cuda 11.8.

Install vLLM

Firstly you need to install vLLM (or use conda add vllm if you are using Anaconda):

pip install vllm

Log in to the Hugging Face hub

You will also need to log in to the Hugging Face hub using:

huggingface-cli login

Run the OpenAI compatible inference endpoint

You can then use the following command to start the server:

python -u -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--model mistralai/Mistral-7B-Instruct-v0.2