[How To] Deploy AI Models with Docker: Linux Guide
Deploy AI models with Docker on Linux to achieve environment consistency, rapid scaling, and seamless GPU integration. As AI workloads become increasingly complex in 2026, containerization has evolved from a convenience to a necessity for developers and system administrators. By encapsulating dependencies such as CUDA libraries, Python runtimes, and model weights within a Docker image, you eliminate the “it works on my machine” problem and simplify deployment across diverse Linux distributions. This tutorial provides a step-by-step walkthrough for setting up your environment and running state-of-the-art models using industry-standard tools.
Table of Contents
- Prerequisites to Deploy AI Models with Docker
- Installing the NVIDIA Container Toolkit
- Configuring Docker for GPU Acceleration
- Deploy AI Models with Docker using Ollama
- Deploy AI Models with Docker for Hugging Face
- Best Practices for AI Container Deployment
- Conclusion
Prerequisites to Deploy AI Models with Docker
Before you deploy AI models with Docker on Linux, ensure your host system meets the following technical requirements to support heavy inference workloads. While Docker can run on many distributions, we recommend using Ubuntu 24.04 LTS for the best compatibility with NVIDIA drivers.
- NVIDIA GPU: A modern GPU with at least 8GB of VRAM (RTX 30-series or newer is recommended for 2026 models).
- NVIDIA Drivers: Version 550 or higher installed on the host. You can check this using
nvidia-smi. - Docker Engine: Version 24.0 or later. For more information, visit the official Docker installation guide. Avoid using Snap versions of Docker as they often have issues accessing GPU hardware.
- Internal Knowledge: Familiarity with basic CUDA setup on Ubuntu will help troubleshoot driver-related issues.
Installing the NVIDIA Container Toolkit
The NVIDIA Container Toolkit is the bridge that allows Docker containers to communicate directly with your physical GPU. Without this toolkit, your containers would be restricted to CPU-only inference, which is significantly slower for most AI tasks.
Step 1: Configure the Repository
First, we need to add the NVIDIA package repositories to our system so that apt can find the necessary tools.
lc-root@ubuntu:~$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
Step 2: Install the Toolkit
Update your local package index and install the toolkit package. This installation includes the nvidia-ctk utility used for configuration.
lc-root@ubuntu:~$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
Configuring Docker for GPU Acceleration
Once the toolkit is installed, you must configure the Docker daemon to recognize the NVIDIA runtime. This ensures that when you use the --gpus flag, Docker knows how to pass the hardware through to the container.
lc-root@ubuntu:~$ sudo nvidia-ctk runtime configure --runtime=docker lc-root@ubuntu:~$ sudo systemctl restart docker
To verify that everything is working correctly, run a test container using the official NVIDIA CUDA image. If successful, you should see the GPU status table within your terminal.
lc-root@ubuntu:~$ docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
Deploy AI Models with Docker using Ollama
Ollama has become the standard for running Large Language Models (LLMs) locally. Using Docker to install and configure Ollama provides an isolated environment that keeps your host system clean while giving you access to models like Llama 3 or DeepSeek.
Use the following command to start the Ollama server with full GPU support and persistent storage for your models. You can find more models in the Ollama model library.
lc-root@ubuntu:~$ docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
After the container is running, you can “pull” and run a specific model, such as DeepSeek R1, by executing a command inside the container:
lc-root@ubuntu:~$ docker exec -it ollama ollama run deepseek-r1:7b
Deploy AI Models with Docker for Hugging Face
For production environments where you need to deploy Hugging Face transformers as a scalable API, tools like vLLM or Text Generation Inference (TGI) are preferred. These tools are often distributed as Docker images optimized for high throughput.
The following example demonstrates how to serve a Hugging Face model using the vLLM container. Note the use of the HUGGING_FACE_HUB_TOKEN environment variable for gated models.
lc-root@ubuntu:~$ docker run -d --name vllm-server \ --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ --env HUGGING_FACE_HUB_TOKEN="your_token_here" \ vllm/vllm-openai:latest \ --model meta-llama/Llama-3.1-8B-Instruct
Best Practices for AI Container Deployment
Managing AI workloads requires a different approach than traditional web services due to the heavy resource demands. Follow these best practices to ensure stability and performance.
- Volume Mounting: Always mount a host directory or Docker volume to
/root/.cacheor/root/.ollama. AI models are massive, and you don’t want to re-download 10GB+ files every time you restart a container. - Resource Limits: Use Docker’s
--memoryand--cpusflags to prevent a single runaway inference process from crashing your entire Linux server. - Image Versioning: Never use the
:latesttag for production. Explicitly version your base images (e.g.,nvidia/cuda:12.6.0-base-ubuntu24.04) to ensure reproducible builds. - Monitoring: Keep an eye on VRAM usage. If you run out of VRAM, Docker containers might crash with “Out of Memory” (OOM) errors that are difficult to debug. Consult our guide on AI distributions for tools that help monitor GPU health.
Conclusion
When you deploy AI models with Docker on Linux, you leverage the full power of containerization to build a robust AI infrastructure. By correctly setting up the NVIDIA Container Toolkit and utilizing optimized images from Ollama or Hugging Face, you can transform a standard Linux server into a high-performance AI inference engine. As you move forward, consider exploring orchestration tools like Docker Compose or Kubernetes to manage multiple model instances and scale your AI capabilities further.