[How To] Hugging Face Linux Deployment: Deploy Transformers
The hugging face linux deployment process allows developers to serve powerful AI models for real-world applications. By leveraging tools like FastAPI and Docker, you can create a scalable and reproducible environment to deploy Hugging Face Transformer models on a Linux server. This guide provides a comprehensive walkthrough of setting up a production-ready inference service from scratch.
Furthermore, this tutorial will cover everything from setting up a local Python environment to containerizing the application with Docker and deploying it on a fresh Ubuntu server. We will, therefore, build a simple API endpoint that takes a text prompt and returns a generated sequence from a pre-trained model.
Table of Contents
- Prerequisites
- Step 1: Setting Up the Local Python Environment
- Step 2: Creating the FastAPI Application
- Step 3: Dockerizing the Application
- Step 4: Deploying on a Linux Server
- Conclusion
- Next Steps
Prerequisites for Hugging Face Linux Deployment
Before you begin, ensure you have the following:
- A Linux server running a recent version of Ubuntu (this guide uses Ubuntu 24.04 LTS).
sudoor root privileges on the server.- Basic familiarity with the Linux command line and Python programming.
- Docker installed on your server.
- An understanding of what a container is. For more details, see our beginner’s guide to Linux containers.
Step 1: Preparing Your Python Environment for Hugging Face Linux Deployment
To begin, first connect to your Linux server and update its package repositories. It’s always a good practice, therefore, to start with an up-to-date system.
lc-root@ubuntu:~$ sudo apt update && sudo apt upgrade -y
Subsequently, install Python and its virtual environment manager, venv.
lc-root@ubuntu:~$ sudo apt install -y python3-pip python3-venv
Thus, create a project directory for your application and create a virtual environment within it. This isolates your project’s dependencies from the system’s Python packages.
lc-root@ubuntu:~$ mkdir hf_deployment lc-root@ubuntu:~$ cd hf_deployment lc-root@ubuntu:~$ python3 -m venv .venv
Next, activate the virtual environment:
lc-root@ubuntu:~$ source .venv/bin/activate (.venv) lc-root@ubuntu:~$
Consequently, your shell prompt should now be prefixed with (.venv), indicating that the virtual environment is active.
Step 2: Creating the FastAPI Application
With the environment now ready, you can proceed to install the necessary Python libraries: fastapi for the API, uvicorn as the server, and transformers with torch for the AI model, respectively.
(.venv) lc-root@ubuntu:~$ pip install fastapi uvicorn transformers torch
Consequently, create a file named main.py to house the API logic. This code will load a pre-trained model and, moreover, create an endpoint to handle inference requests.
# main.py
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline
# Initialize FastAPI app
app = FastAPI(
title="Hugging Face Inference API",
description="An API for text generation using a Hugging Face model.",
version="1.0"
)
# Load the text-generation pipeline
# distilgpt2 is a smaller, faster version of GPT-2
try:
generator = pipeline("text-generation", model="distilgpt2")
except Exception as e:
generator = None
print(f"Failed to load model: {e}")
# Define request and response models for type validation
class GenerationRequest(BaseModel):
prompt: str
max_length: int = 40
class GenerationResponse(BaseModel):
generated_text: str
@app.get("/")
def read_root():
return {"status": "API is running. Visit /docs for details."}
@app.post("/generate", response_model=GenerationResponse)
def generate_text(request: GenerationRequest):
if generator is None:
return {"generated_text": "Model is not available."}
result = generator(request.prompt, max_length=request.max_length)
return {"generated_text": result[0]['generated_text']}
This script sets up a /generate endpoint that accepts a prompt and returns the model’s output. For more background on creating models, consider reading about how to build your first AI model on Linux.
Step 3: Dockerizing the Application
Docker allows you to package your application and its dependencies into a single, portable container. Create a Dockerfile in your project directory:
# Dockerfile # Use an official Python runtime as a parent image FROM python:3.10-slim # Set the working directory in the container WORKDIR /app # Copy the dependencies file to the working directory COPY requirements.txt . # Install any needed packages specified in requirements.txt RUN pip install --no-cache-dir --trusted-host pypi.python.org -r requirements.txt # Copy the content of the local src directory to the working directory COPY main.py . # Expose the port the app runs on EXPOSE 8000 # Run the application CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
In addition, you will need a requirements.txt file to list the Python dependencies for the Docker build.
(.venv) lc-root@ubuntu:~$ pip freeze > requirements.txt
Therefore, build the Docker image. This command tells Docker to build an image from the Dockerfile in the current directory and tag it as hf-api-server.
lc-root@ubuntu:~$ docker build -t hf-api-server .
Moreover, this process might take some time, as Docker will download the base image, install dependencies, and also download the Hugging Face model into a layer of the image.
Step 4: Completing Your Hugging Face Linux Deployment
Subsequently, once the image is built, you can run it as a container. The following command, in particular, starts the container in detached mode and maps port 80 on the host to port 8000 in the container.
lc-root@ubuntu:~$ docker run -d -p 80:8000 --name hf-inference-api hf-api-server
You can verify that the container is running with:
lc-root@ubuntu:~$ docker ps
In order to test your API, use curl to send a request to the /generate endpoint from your server’s terminal:
lc-root@ubuntu:~$ curl -X POST "http://127.0.0.1/generate" -H "Content-Type: application/json" -d '{"prompt": "The future of AI is"}'
You should receive a JSON response with the generated text, demonstrating a successful hugging face linux deployment. Other local inference tools can also be interesting to explore, such as installing Ollama for local AI inference.
Conclusion
You have successfully deployed a Hugging Face Transformers model as a web service on a Linux server. By containerizing the application with Docker, you’ve created a portable and scalable service that can be easily managed and deployed across different environments. This setup provides a solid foundation for building more complex AI-powered applications.
Next Steps
From here, you can explore several enhancements:
- GPU Acceleration: For better performance, deploy on a GPU-enabled server and use a Docker image with CUDA support.
- Scalability: Use a container orchestrator like Kubernetes to manage multiple instances of your API for high availability and load balancing.
- Security: Implement authentication and rate limiting to protect your API from unauthorized access.
- Choosing a Distro: For specialized AI workloads, you may want to evaluate different Linux distributions for AI and Machine Learning.