[How To] Hugging Face Linux Deployment: Deploy Transformers

The hugging face linux deployment process allows developers to serve powerful AI models for real-world applications. By leveraging tools like FastAPI and Docker, you can create a scalable and reproducible environment to deploy Hugging Face Transformer models on a Linux server. This guide provides a comprehensive walkthrough of setting up a production-ready inference service from scratch.

Furthermore, this tutorial will cover everything from setting up a local Python environment to containerizing the application with Docker and deploying it on a fresh Ubuntu server. We will, therefore, build a simple API endpoint that takes a text prompt and returns a generated sequence from a pre-trained model.

Table of Contents

Prerequisites for Hugging Face Linux Deployment

Before you begin, ensure you have the following:

  • A Linux server running a recent version of Ubuntu (this guide uses Ubuntu 24.04 LTS).
  • sudo or root privileges on the server.
  • Basic familiarity with the Linux command line and Python programming.
  • Docker installed on your server.
  • An understanding of what a container is. For more details, see our beginner’s guide to Linux containers.

Step 1: Preparing Your Python Environment for Hugging Face Linux Deployment

To begin, first connect to your Linux server and update its package repositories. It’s always a good practice, therefore, to start with an up-to-date system.

lc-root@ubuntu:~$ sudo apt update && sudo apt upgrade -y

Subsequently, install Python and its virtual environment manager, venv.

lc-root@ubuntu:~$ sudo apt install -y python3-pip python3-venv

Thus, create a project directory for your application and create a virtual environment within it. This isolates your project’s dependencies from the system’s Python packages.

lc-root@ubuntu:~$ mkdir hf_deployment
lc-root@ubuntu:~$ cd hf_deployment
lc-root@ubuntu:~$ python3 -m venv .venv

Next, activate the virtual environment:

lc-root@ubuntu:~$ source .venv/bin/activate
(.venv) lc-root@ubuntu:~$

Consequently, your shell prompt should now be prefixed with (.venv), indicating that the virtual environment is active.

Step 2: Creating the FastAPI Application

With the environment now ready, you can proceed to install the necessary Python libraries: fastapi for the API, uvicorn as the server, and transformers with torch for the AI model, respectively.

(.venv) lc-root@ubuntu:~$ pip install fastapi uvicorn transformers torch

Consequently, create a file named main.py to house the API logic. This code will load a pre-trained model and, moreover, create an endpoint to handle inference requests.

# main.py
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline

# Initialize FastAPI app
app = FastAPI(
    title="Hugging Face Inference API",
    description="An API for text generation using a Hugging Face model.",
    version="1.0"
)

# Load the text-generation pipeline
# distilgpt2 is a smaller, faster version of GPT-2
try:
    generator = pipeline("text-generation", model="distilgpt2")
except Exception as e:
    generator = None
    print(f"Failed to load model: {e}")

# Define request and response models for type validation
class GenerationRequest(BaseModel):
    prompt: str
    max_length: int = 40

class GenerationResponse(BaseModel):
    generated_text: str

@app.get("/")
def read_root():
    return {"status": "API is running. Visit /docs for details."}

@app.post("/generate", response_model=GenerationResponse)
def generate_text(request: GenerationRequest):
    if generator is None:
        return {"generated_text": "Model is not available."}
    
    result = generator(request.prompt, max_length=request.max_length)
    return {"generated_text": result[0]['generated_text']}

This script sets up a /generate endpoint that accepts a prompt and returns the model’s output. For more background on creating models, consider reading about how to build your first AI model on Linux.

Step 3: Dockerizing the Application

Docker allows you to package your application and its dependencies into a single, portable container. Create a Dockerfile in your project directory:

# Dockerfile
# Use an official Python runtime as a parent image
FROM python:3.10-slim

# Set the working directory in the container
WORKDIR /app

# Copy the dependencies file to the working directory
COPY requirements.txt .

# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir --trusted-host pypi.python.org -r requirements.txt

# Copy the content of the local src directory to the working directory
COPY main.py .

# Expose the port the app runs on
EXPOSE 8000

# Run the application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

In addition, you will need a requirements.txt file to list the Python dependencies for the Docker build.

(.venv) lc-root@ubuntu:~$ pip freeze > requirements.txt

Therefore, build the Docker image. This command tells Docker to build an image from the Dockerfile in the current directory and tag it as hf-api-server.

lc-root@ubuntu:~$ docker build -t hf-api-server .

Moreover, this process might take some time, as Docker will download the base image, install dependencies, and also download the Hugging Face model into a layer of the image.

Step 4: Completing Your Hugging Face Linux Deployment

Subsequently, once the image is built, you can run it as a container. The following command, in particular, starts the container in detached mode and maps port 80 on the host to port 8000 in the container.

lc-root@ubuntu:~$ docker run -d -p 80:8000 --name hf-inference-api hf-api-server

You can verify that the container is running with:

lc-root@ubuntu:~$ docker ps

In order to test your API, use curl to send a request to the /generate endpoint from your server’s terminal:

lc-root@ubuntu:~$ curl -X POST "http://127.0.0.1/generate" -H "Content-Type: application/json" -d '{"prompt": "The future of AI is"}'

You should receive a JSON response with the generated text, demonstrating a successful hugging face linux deployment. Other local inference tools can also be interesting to explore, such as installing Ollama for local AI inference.

Conclusion

You have successfully deployed a Hugging Face Transformers model as a web service on a Linux server. By containerizing the application with Docker, you’ve created a portable and scalable service that can be easily managed and deployed across different environments. This setup provides a solid foundation for building more complex AI-powered applications.

Next Steps

From here, you can explore several enhancements:

  • GPU Acceleration: For better performance, deploy on a GPU-enabled server and use a Docker image with CUDA support.
  • Scalability: Use a container orchestrator like Kubernetes to manage multiple instances of your API for high availability and load balancing.
  • Security: Implement authentication and rate limiting to protect your API from unauthorized access.
  • Choosing a Distro: For specialized AI workloads, you may want to evaluate different Linux distributions for AI and Machine Learning.

Outbound Links

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.