[How To] Deploy AI Models with Docker: Linux Guide

Deploy AI models with Docker on Linux to achieve environment consistency, rapid scaling, and seamless GPU integration. As AI workloads become increasingly complex in 2026, containerization has evolved from a convenience to a necessity for developers and system administrators. By encapsulating dependencies such as CUDA libraries, Python runtimes, and model weights within a Docker image, you eliminate the “it works on my machine” problem and simplify deployment across diverse Linux distributions. This tutorial provides a step-by-step walkthrough for setting up your environment and running state-of-the-art models using industry-standard tools.

Table of Contents

Prerequisites to Deploy AI Models with Docker

Before you deploy AI models with Docker on Linux, ensure your host system meets the following technical requirements to support heavy inference workloads. While Docker can run on many distributions, we recommend using Ubuntu 24.04 LTS for the best compatibility with NVIDIA drivers.

  • NVIDIA GPU: A modern GPU with at least 8GB of VRAM (RTX 30-series or newer is recommended for 2026 models).
  • NVIDIA Drivers: Version 550 or higher installed on the host. You can check this using nvidia-smi.
  • Docker Engine: Version 24.0 or later. For more information, visit the official Docker installation guide. Avoid using Snap versions of Docker as they often have issues accessing GPU hardware.
  • Internal Knowledge: Familiarity with basic CUDA setup on Ubuntu will help troubleshoot driver-related issues.

Installing the NVIDIA Container Toolkit

The NVIDIA Container Toolkit is the bridge that allows Docker containers to communicate directly with your physical GPU. Without this toolkit, your containers would be restricted to CPU-only inference, which is significantly slower for most AI tasks.

Step 1: Configure the Repository

First, we need to add the NVIDIA package repositories to our system so that apt can find the necessary tools.

lc-root@ubuntu:~$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Step 2: Install the Toolkit

Update your local package index and install the toolkit package. This installation includes the nvidia-ctk utility used for configuration.

lc-root@ubuntu:~$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

Configuring Docker for GPU Acceleration

Once the toolkit is installed, you must configure the Docker daemon to recognize the NVIDIA runtime. This ensures that when you use the --gpus flag, Docker knows how to pass the hardware through to the container.

lc-root@ubuntu:~$ sudo nvidia-ctk runtime configure --runtime=docker
lc-root@ubuntu:~$ sudo systemctl restart docker

To verify that everything is working correctly, run a test container using the official NVIDIA CUDA image. If successful, you should see the GPU status table within your terminal.

lc-root@ubuntu:~$ docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Deploy AI Models with Docker using Ollama

Ollama has become the standard for running Large Language Models (LLMs) locally. Using Docker to install and configure Ollama provides an isolated environment that keeps your host system clean while giving you access to models like Llama 3 or DeepSeek.

Use the following command to start the Ollama server with full GPU support and persistent storage for your models. You can find more models in the Ollama model library.

lc-root@ubuntu:~$ docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

After the container is running, you can “pull” and run a specific model, such as DeepSeek R1, by executing a command inside the container:

lc-root@ubuntu:~$ docker exec -it ollama ollama run deepseek-r1:7b

Deploy AI Models with Docker for Hugging Face

For production environments where you need to deploy Hugging Face transformers as a scalable API, tools like vLLM or Text Generation Inference (TGI) are preferred. These tools are often distributed as Docker images optimized for high throughput.

The following example demonstrates how to serve a Hugging Face model using the vLLM container. Note the use of the HUGGING_FACE_HUB_TOKEN environment variable for gated models.

lc-root@ubuntu:~$ docker run -d --name vllm-server \
  --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --env HUGGING_FACE_HUB_TOKEN="your_token_here" \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct

Best Practices for AI Container Deployment

Managing AI workloads requires a different approach than traditional web services due to the heavy resource demands. Follow these best practices to ensure stability and performance.

  • Volume Mounting: Always mount a host directory or Docker volume to /root/.cache or /root/.ollama. AI models are massive, and you don’t want to re-download 10GB+ files every time you restart a container.
  • Resource Limits: Use Docker’s --memory and --cpus flags to prevent a single runaway inference process from crashing your entire Linux server.
  • Image Versioning: Never use the :latest tag for production. Explicitly version your base images (e.g., nvidia/cuda:12.6.0-base-ubuntu24.04) to ensure reproducible builds.
  • Monitoring: Keep an eye on VRAM usage. If you run out of VRAM, Docker containers might crash with “Out of Memory” (OOM) errors that are difficult to debug. Consult our guide on AI distributions for tools that help monitor GPU health.

Conclusion

When you deploy AI models with Docker on Linux, you leverage the full power of containerization to build a robust AI infrastructure. By correctly setting up the NVIDIA Container Toolkit and utilizing optimized images from Ollama or Hugging Face, you can transform a standard Linux server into a high-performance AI inference engine. As you move forward, consider exploring orchestration tools like Docker Compose or Kubernetes to manage multiple model instances and scale your AI capabilities further.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.