The Practitioner's Guide to LLMOps
From Zero to Deploying Stable Diffusion & LLMs on Your Own Infrastructure
A hands-on, code-heavy guide for engineers who want to stop reading theory and start shipping AI models to production.
Why This Blog Exists
Here's the uncomfortable truth about most "deploy your LLM" tutorials: they show you a model.generate() call in a Jupyter notebook and call it deployment. That's like saying you know how to cook because you can microwave ramen.
Real deployment means your model survives real traffic, restarts gracefully, scales under load, and doesn't bankrupt you with GPU costs at 3 AM.
This guide takes you through the full journey — from writing your first model handler to deploying on Kubernetes with monitoring, rate limiting, and cost controls. Every code block is production-tested. Every architecture decision is explained with the why, not just the what.
Let's ship something real.
01 What is LLMOps?
Think of it like running a restaurant. MLOps is the kitchen — you train models, validate them, and serve predictions. LLMOps is the entire restaurant operation: the kitchen, the waitstaff, the reservation system, the supply chain, the health inspections, and the accountant making sure you don't go bankrupt buying wagyu beef for every dish.
LLMs are fundamentally different from traditional ML models. They're massive, expensive, non-deterministic, and they talk back. That changes everything about how you operate them.
MLOps vs LLMOps
| Dimension | Traditional MLOps | LLMOps |
|---|---|---|
| Model Size | MBs to low GBs | GBs to hundreds of GBs |
| Training | Train from scratch, retrain often | Fine-tune or prompt-engineer a foundation model |
| Inference Cost | Cheap (CPU often sufficient) | Expensive (GPU required, cost per token) |
| Evaluation | Accuracy, F1, AUC — well-defined metrics | Vibes-based + human eval + LLM-as-judge |
| Versioning | Model weights + data | Prompts + model version + RAG corpus + guardrails |
| Failure Modes | Wrong prediction | Hallucination, prompt injection, toxic output, cost explosion |
| Scaling | Horizontal, relatively simple | GPU-bound, requires batching, quantization, caching |
The LLMOps Lifecycle
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ SELECT │ → │ SERVE │ → │ MONITOR │ → │ OPTIMIZE │ │ MODEL │ │ MODEL │ │ & EVAL │ │ & SCALE │ └───────────┘ └───────────┘ └───────────┘ └───────────┘ Fine-tune FastAPI / Latency, Quantize, or use vLLM / Cost per Batch, off-shelf LitServe request Cache
The 2025 Reality Check
The LLMOps landscape has matured significantly. Here's what actually matters now:
- Context Engineering > Prompt Engineering. It's no longer just about clever prompts. It's about building systems that feed the right context — RAG pipelines, tool outputs, memory — into the model at the right time. The prompt is the last mile, not the whole journey.
- Guardrails are non-negotiable. Every production LLM deployment needs input validation, output filtering, and cost circuit breakers. Not as an afterthought — as core infrastructure. One prompt injection in production and your CEO is on the phone.
- Software Engineering > ML Engineering. The bottleneck in LLMOps isn't model performance — it's building reliable systems around unreliable models. Retry logic, graceful degradation, queue management, observability. The boring stuff is what keeps you in production.
02 Setting Up Your Toolkit
Before we deploy anything, let's set up a clean environment with everything we'll need.
# Create project directory
mkdir llmops-stack && cd llmops-stack
# Python environment (using uv for speed)
uv init
uv add fastapi uvicorn pillow torch diffusers transformers
uv add litserve aiohttp pydantic python-multipart
uv add prometheus-client structlog
# Docker (make sure you have Docker Desktop with GPU support)
docker --version # Need 24.0+
nvidia-smi # Verify GPU access
# Kubernetes tools (for later)
# brew install kubectl helm k9s
Here's the project structure we'll build throughout this guide:
llmops-stack/
├── models/
│ ├── sd_handler.py # Stable Diffusion model handler
│ └── llm_handler.py # LLM model handler
├── api/
│ ├── sd_server.py # Stable Diffusion FastAPI server
│ └── llm_server.py # LLM LitServe server
├── middleware/
│ ├── rate_limiter.py # Rate limiting
│ ├── queue_handler.py # Request queue
│ └── metrics.py # Prometheus metrics
├── docker/
│ ├── Dockerfile.sd # SD container
│ ├── Dockerfile.llm # LLM container
│ └── docker-compose.yml # Full stack orchestration
├── k8s/
│ └── sd-deployment.yaml # Kubernetes manifests
└── tests/
├── test_sd_api.py
└── test_llm_api.py
03 Deploy Stable Diffusion as an API
Let's start with something visual and satisfying — turning Stable Diffusion into a production API. This section covers the full journey: model handler, API server, containerization, and orchestration.
Step 1: The Model Handler
The model handler is responsible for loading the model into GPU memory and running inference. Every decision here has a production reason behind it.
import torch
from diffusers import StableDiffusionPipeline
from PIL import Image
import io
import base64
import logging
from typing import Optional
logger = logging.getLogger(__name__)
class StableDiffusionHandler:
"""Production-ready Stable Diffusion model handler.
Design decisions:
- Singleton pattern: Loading a 4GB model per request would be insane.
We load once at startup and reuse across all requests.
- float16 precision: Cuts VRAM usage in half with negligible quality loss.
A100 users can try bfloat16 for slightly better quality.
- Safety checker disabled: For internal/controlled deployments.
Re-enable for public-facing APIs (legal liability is real).
- attention_slicing: Trades ~10% speed for ~30% less VRAM.
Essential on consumer GPUs (RTX 3090, 4090).
"""
def __init__(
self,
model_id: str = "stabilityai/stable-diffusion-2-1",
device: str = "cuda",
dtype: torch.dtype = torch.float16,
):
self.model_id = model_id
self.device = device
self.dtype = dtype
self.pipeline: Optional[StableDiffusionPipeline] = None
def load_model(self) -> None:
"""Load the model into GPU memory.
This is called ONCE at server startup, not per request.
On a T4 GPU, this takes ~15 seconds. On an A100, ~5 seconds.
"""
logger.info(f"Loading model: {self.model_id}")
logger.info(f"Device: {self.device}, Dtype: {self.dtype}")
self.pipeline = StableDiffusionPipeline.from_pretrained(
self.model_id,
torch_dtype=self.dtype,
# safety_checker=None, # Uncomment for internal use only
)
self.pipeline = self.pipeline.to(self.device)
# Memory optimization: essential on GPUs with <24GB VRAM
self.pipeline.enable_attention_slicing()
# Optional: even more VRAM savings at cost of speed
# self.pipeline.enable_vae_slicing()
# self.pipeline.enable_model_cpu_offload()
logger.info("Model loaded successfully")
def generate(
self,
prompt: str,
negative_prompt: str = "blurry, low quality, distorted",
num_inference_steps: int = 30,
guidance_scale: float = 7.5,
width: int = 512,
height: int = 512,
) -> str:
"""Generate an image and return it as a base64 string.
Why base64? Three reasons:
1. No file system dependency -- works in containers with read-only fs
2. Easy to embed in JSON API responses
3. Avoids dealing with file cleanup and temp directories
For production with high volume, consider streaming to S3/GCS instead.
"""
if self.pipeline is None:
raise RuntimeError("Model not loaded. Call load_model() first.")
# Generate the image
with torch.no_grad(): # Saves ~20% VRAM during inference
result = self.pipeline(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
width=width,
height=height,
)
image: Image.Image = result.images[0]
# Convert to base64 for API response
buffer = io.BytesIO()
image.save(buffer, format="PNG")
img_base64 = base64.b64encode(buffer.getvalue()).decode("utf-8")
return img_base64
def health_check(self) -> dict:
"""Verify the model is loaded and GPU is accessible."""
return {
"status": "healthy" if self.pipeline else "not_loaded",
"model": self.model_id,
"device": self.device,
"gpu_memory_allocated": f"{torch.cuda.memory_allocated() / 1e9:.2f} GB"
if torch.cuda.is_available() else "N/A",
}
Step 2: The FastAPI Server
Now let's wrap our handler in a proper API server with health checks, request validation, and structured logging.
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
import time
import logging
from sd_handler import StableDiffusionHandler
logger = logging.getLogger(__name__)
# --- Model lifecycle management ---
# Why lifespan? Because we need the model loaded BEFORE any request
# hits the server. The old @app.on_event("startup") is deprecated.
handler = StableDiffusionHandler()
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Load model at startup, clean up at shutdown."""
handler.load_model() # ~15 seconds on T4, blocks until ready
yield
# Cleanup: free GPU memory
del handler.pipeline
if torch.cuda.is_available():
torch.cuda.empty_cache()
app = FastAPI(
title="Stable Diffusion API",
version="1.0.0",
lifespan=lifespan,
)
# --- Request/Response Models ---
# Pydantic handles validation so we don't have to.
# A 1024x1024 image with 50 steps will OOM on a T4 -- so we set limits.
class GenerateRequest(BaseModel):
prompt: str = Field(..., min_length=1, max_length=500)
negative_prompt: str = Field(
default="blurry, low quality, distorted",
max_length=500,
)
num_inference_steps: int = Field(default=30, ge=1, le=50)
guidance_scale: float = Field(default=7.5, ge=1.0, le=20.0)
width: int = Field(default=512, ge=256, le=768)
height: int = Field(default=512, ge=256, le=768)
class GenerateResponse(BaseModel):
image_base64: str
generation_time: float
parameters: dict
# --- Endpoints ---
@app.get("/health")
async def health():
"""Health check -- used by Docker, K8s, and load balancers.
Why a dedicated endpoint? Because GET / returning 200 doesn't
mean the model is loaded. This actually checks GPU state.
"""
return handler.health_check()
@app.post("/generate", response_model=GenerateResponse)
async def generate(request: GenerateRequest):
"""Generate an image from a text prompt.
Why async def for a sync operation? FastAPI runs sync functions
in a thread pool automatically, but we mark it async because
the handler itself is CPU/GPU-bound and we don't want to
block the event loop's thread pool unnecessarily.
"""
start_time = time.time()
try:
image_b64 = handler.generate(
prompt=request.prompt,
negative_prompt=request.negative_prompt,
num_inference_steps=request.num_inference_steps,
guidance_scale=request.guidance_scale,
width=request.width,
height=request.height,
)
except RuntimeError as e:
logger.error(f"Generation failed: {e}")
raise HTTPException(status_code=503, detail="Model not ready")
except Exception as e:
logger.error(f"Unexpected error: {e}")
raise HTTPException(status_code=500, detail=str(e))
generation_time = time.time() - start_time
return GenerateResponse(
image_base64=image_b64,
generation_time=round(generation_time, 2),
parameters=request.model_dump(),
)
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Step 3: Dockerize It
A model that only runs on your laptop isn't deployed. Let's containerize it with a production-optimized Dockerfile.
# Why this base image? It includes CUDA runtime, cuDNN, and Ubuntu 22.04.
# The "runtime" variant is ~3GB smaller than "devel" -- we don't need
# the CUDA compiler for inference.
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
# Prevent interactive prompts during apt-get install
ENV DEBIAN_FRONTEND=noninteractive
# Install Python 3.11 -- not 3.12, because torch wheels
# for CUDA 12.1 are most stable on 3.11 as of early 2026.
RUN apt-get update && apt-get install -y \
python3.11 \
python3.11-venv \
python3-pip \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
# --- Layer caching strategy ---
# Copy requirements FIRST. Docker caches layers, so if only your
# code changes (not dependencies), this layer is reused.
# This saves 10-15 minutes on rebuilds.
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Now copy the application code (changes more often)
COPY . .
# Pre-download the model at build time.
# Why? Because downloading 4GB at container start means 15 minutes
# before your first request. With this, the model is baked into the image.
# Tradeoff: larger image (~8GB), but instant startup.
RUN python3 -c "from diffusers import StableDiffusionPipeline; \
StableDiffusionPipeline.from_pretrained('stabilityai/stable-diffusion-2-1')"
EXPOSE 8000
# Use uvicorn with 1 worker. Why just 1?
# Because each worker loads the full model into VRAM.
# 1 model = ~4GB VRAM. 2 workers = OOM on most GPUs.
# Scale horizontally (more containers) instead of vertically (more workers).
CMD ["uvicorn", "api_server:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
Step 4: Docker Compose
For local development and testing, Docker Compose orchestrates everything with one command.
version: "3.8"
services:
stable-diffusion:
build:
context: .
dockerfile: Dockerfile.sd
ports:
- "8000:8000"
volumes:
# Cache models on host so rebuilds don't re-download 4GB
- model-cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s # Give the model time to load
restart: unless-stopped
environment:
- CUDA_VISIBLE_DEVICES=0
- TRANSFORMERS_CACHE=/root/.cache/huggingface
volumes:
model-cache:
Build and Run
# Build the image (first time takes ~20 minutes due to model download)
docker compose build
# Start the service
docker compose up -d
# Watch the logs (wait for "Model loaded successfully")
docker compose logs -f stable-diffusion
# Test it!
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "a photo of an astronaut riding a horse on mars, cinematic lighting"}'
# Check GPU usage
nvidia-smi
COPY requirements.txt and COPY . . are on separate lines.
04 Deploy an LLM Chat API
Image generation is cool, but most production workloads are text. Let's deploy an LLM with an OpenAI-compatible API using LitServe.
Why LitServe?
You could use vLLM, TGI, or raw FastAPI. Here's why LitServe hits a sweet spot for learning:
- Batteries included — batching, streaming, GPU management out of the box
- OpenAI-compatible — your existing client code just works
- Pythonic — no YAML configs, no custom serving formats, just a Python class
- Production-tested — used by Lightning AI in production at scale
import litserve as ls
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
class LLMEngine:
"""Handles model loading and raw inference.
Separated from the serving layer so you can:
- Test the engine independently
- Swap models without changing the API
- Reuse across different serving frameworks
"""
def __init__(self, model_id: str, device: str = "cuda"):
self.model_id = model_id
self.device = device
self.model = None
self.tokenizer = None
def load(self):
"""Load model and tokenizer into GPU memory."""
self.tokenizer = AutoTokenizer.from_pretrained(self.model_id)
self.model = AutoModelForCausalLM.from_pretrained(
self.model_id,
torch_dtype=torch.float16,
device_map="auto", # Automatically handles multi-GPU sharding
)
self.model.eval() # Disable dropout -- we're doing inference, not training
def generate(self, prompt: str, max_tokens: int = 512, temperature: float = 0.7) -> str:
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=temperature,
do_sample=True,
top_p=0.9,
pad_token_id=self.tokenizer.eos_token_id,
)
# Decode only the NEW tokens, not the input prompt
new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
return self.tokenizer.decode(new_tokens, skip_special_tokens=True)
class LLMServingAPI(ls.LitAPI):
"""LitServe API wrapper.
LitServe calls these methods in order:
1. setup() -- called once at startup (load model)
2. decode() -- parse incoming request
3. predict() -- run inference
4. encode() -- format response
This structure forces clean separation of concerns.
"""
def setup(self, device: str):
"""Called once when the server starts."""
self.engine = LLMEngine(
model_id="microsoft/phi-2", # Small enough for a T4, smart enough to be useful
device=device,
)
self.engine.load()
def decode_request(self, request: dict) -> dict:
"""Parse and validate the incoming request."""
return {
"prompt": request["messages"][-1]["content"],
"max_tokens": request.get("max_tokens", 512),
"temperature": request.get("temperature", 0.7),
}
def predict(self, inputs: dict) -> str:
"""Run inference -- this is where the GPU does its thing."""
return self.engine.generate(
prompt=inputs["prompt"],
max_tokens=inputs["max_tokens"],
temperature=inputs["temperature"],
)
def encode_response(self, output: str) -> dict:
"""Format response as OpenAI-compatible JSON."""
return {
"choices": [{
"message": {
"role": "assistant",
"content": output,
}
}]
}
if __name__ == "__main__":
api = LLMServingAPI()
server = ls.LitServer(
api,
accelerator="gpu",
devices=1,
timeout=60, # Kill requests that take >60s
max_batch_size=4, # Batch up to 4 requests together
batch_timeout=0.05, # Wait max 50ms to fill a batch
)
server.run(port=8001)
Client Code
Because LitServe speaks the OpenAI protocol, you can use the official OpenAI Python client:
from openai import OpenAI
# Point to your local LLM server instead of OpenAI
client = OpenAI(
base_url="http://localhost:8001/v1",
api_key="not-needed", # LitServe doesn't require auth by default
)
response = client.chat.completions.create(
model="phi-2", # This is ignored by LitServe but required by the client
messages=[
{"role": "user", "content": "Explain Docker in 3 sentences."}
],
max_tokens=256,
temperature=0.7,
)
print(response.choices[0].message.content)
base_url). This is crucial for hybrid deployments where you route cheap queries to your local model and complex ones to GPT-4.
LLM Dockerfile
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
python3.11 python3.11-venv python3-pip \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements-llm.txt .
RUN pip install --no-cache-dir -r requirements-llm.txt
COPY . .
# Pre-download model weights at build time
RUN python3 -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
AutoTokenizer.from_pretrained('microsoft/phi-2'); \
AutoModelForCausalLM.from_pretrained('microsoft/phi-2')"
EXPOSE 8001
CMD ["python3", "api_server.py"]
05 Production Hardening
You have a working API. Now let's make it survive the real world. Production means handling abusive clients, managing request queues when the GPU is saturated, and knowing when things go wrong before your users tell you.
Rate Limiting
Without rate limiting, one client with a while True loop will DoS your GPU and nobody else gets served. This middleware uses a simple sliding window counter per client IP.
import time
from collections import defaultdict
from fastapi import Request, HTTPException
from starlette.middleware.base import BaseHTTPMiddleware
class RateLimitMiddleware(BaseHTTPMiddleware):
"""Sliding window rate limiter per client IP.
Why not use a library? Because understanding rate limiting is
essential for LLMOps. GPU inference is SLOW (2-30 seconds per
request) -- you can't handle 1000 req/s like a REST API.
For production at scale, use Redis-backed rate limiting.
This in-memory version works for single-instance deployments.
"""
def __init__(self, app, requests_per_minute: int = 10):
super().__init__(app)
self.requests_per_minute = requests_per_minute
self.clients: dict[str, list[float]] = defaultdict(list)
async def dispatch(self, request: Request, call_next):
# Get client IP (handle proxies)
client_ip = request.client.host
forwarded = request.headers.get("X-Forwarded-For")
if forwarded:
client_ip = forwarded.split(",")[0].strip()
now = time.time()
window_start = now - 60 # 1-minute sliding window
# Remove timestamps outside the window
self.clients[client_ip] = [
t for t in self.clients[client_ip]
if t > window_start
]
# Check if over limit
if len(self.clients[client_ip]) >= self.requests_per_minute:
raise HTTPException(
status_code=429,
detail="Rate limit exceeded. Try again in 60 seconds.",
)
# Record this request
self.clients[client_ip].append(now)
return await call_next(request)
Request Queue
GPU inference is inherently serial (one model, one GPU, one request at a time). When requests arrive faster than the GPU can process them, you need a queue. Without one, requests pile up, timeouts cascade, and everything falls over.
import asyncio
import time
from dataclasses import dataclass, field
from typing import Any
import logging
logger = logging.getLogger(__name__)
@dataclass
class InferenceRequest:
"""A single queued inference request."""
request_id: str
payload: dict
future: asyncio.Future
enqueued_at: float = field(default_factory=time.time)
class InferenceQueue:
"""Async queue for GPU inference requests.
Why a custom queue instead of just asyncio.Queue?
- Position tracking: clients can poll "where am I in line?"
- Timeout handling: stale requests get dropped, not processed
- Backpressure: reject new requests when the queue is full
instead of accepting them and timing out later (which wastes
everyone's time)
"""
def __init__(self, max_size: int = 50, request_timeout: float = 120.0):
self.queue: asyncio.Queue = asyncio.Queue(maxsize=max_size)
self.request_timeout = request_timeout
self.active_requests: dict[str, InferenceRequest] = {}
async def enqueue(self, request_id: str, payload: dict) -> Any:
"""Add a request to the queue and wait for the result."""
if self.queue.full():
raise RuntimeError("Queue is full. Try again later.")
future = asyncio.get_event_loop().create_future()
req = InferenceRequest(
request_id=request_id,
payload=payload,
future=future,
)
self.active_requests[request_id] = req
await self.queue.put(req)
logger.info(f"Enqueued {request_id}, queue size: {self.queue.qsize()}")
try:
result = await asyncio.wait_for(future, timeout=self.request_timeout)
return result
except asyncio.TimeoutError:
logger.warning(f"Request {request_id} timed out")
raise RuntimeError("Request timed out")
finally:
self.active_requests.pop(request_id, None)
async def process_loop(self, handler_func):
"""Worker loop that processes requests one at a time.
Why sequential? Because the GPU can only run one inference
at a time (unless you're doing continuous batching, which
is a whole other level of complexity).
"""
while True:
req = await self.queue.get()
# Skip if the client already disconnected
if req.future.done():
continue
try:
result = await handler_func(req.payload)
req.future.set_result(result)
except Exception as e:
req.future.set_exception(e)
def get_position(self, request_id: str) -> int:
"""Get the position of a request in the queue."""
return self.queue.qsize()
Monitoring
If you can't measure it, you can't improve it. Prometheus metrics give you visibility into exactly what your model server is doing.
from prometheus_client import (
Counter, Histogram, Gauge, CollectorRegistry, generate_latest
)
import time
import functools
class ServerMetrics:
"""Prometheus metrics for model serving.
These four metrics tell you everything:
- request_count: Are people using this? Is traffic growing?
- request_latency: How fast is inference? Are we degrading?
- gpu_utilization: Are we wasting money on idle GPUs?
- queue_size: Are requests backing up?
"""
def __init__(self):
self.registry = CollectorRegistry()
self.request_count = Counter(
"inference_requests_total",
"Total number of inference requests",
["model", "status"],
registry=self.registry,
)
self.request_latency = Histogram(
"inference_latency_seconds",
"Inference latency in seconds",
["model"],
# Custom buckets tuned for GPU inference (not web requests)
buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0],
registry=self.registry,
)
self.gpu_utilization = Gauge(
"gpu_utilization_percent",
"GPU utilization percentage",
registry=self.registry,
)
self.queue_size = Gauge(
"inference_queue_size",
"Current number of requests in the queue",
registry=self.registry,
)
def track_request(self, model_name: str):
"""Decorator to track request count and latency."""
def decorator(func):
@functools.wraps(func)
async def wrapper(*args, **kwargs):
start = time.time()
try:
result = await func(*args, **kwargs)
self.request_count.labels(
model=model_name, status="success"
).inc()
return result
except Exception as e:
self.request_count.labels(
model=model_name, status="error"
).inc()
raise
finally:
self.request_latency.labels(
model=model_name
).observe(time.time() - start)
return wrapper
return decorator
def get_metrics(self) -> bytes:
"""Export metrics in Prometheus format."""
return generate_latest(self.registry)
/metrics endpoint every 15s, Grafana visualizes it, and Alertmanager pages you at 3 AM when latency spikes. Set up alerts for: p99 latency > 30s, error rate > 5%, queue size > 20, GPU utilization < 10% (you're wasting money).
06 Deploying to Kubernetes
Docker Compose works on a single machine. When you need multiple replicas, rolling updates, and automatic recovery, it's time for Kubernetes. Here's a production-ready deployment manifest with every decision explained.
apiVersion: apps/v1
kind: Deployment
metadata:
name: stable-diffusion
labels:
app: stable-diffusion
spec:
replicas: 2 # 2 GPUs = 2 replicas. Scale by adding GPUs.
selector:
matchLabels:
app: stable-diffusion
template:
metadata:
labels:
app: stable-diffusion
spec:
containers:
- name: sd-api
image: your-registry/sd-api:latest
ports:
- containerPort: 8000
# --- Probes ---
# readinessProbe: "Is this pod ready to receive traffic?"
# The model takes ~15s to load. Without this probe, K8s would
# send traffic to a pod that hasn't loaded the model yet = 503 errors.
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30 # Give the model time to load
periodSeconds: 10
failureThreshold: 3
# livenessProbe: "Is this pod still alive?"
# If inference hangs (GPU deadlock, OOM), K8s will restart the pod.
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 30
failureThreshold: 3
resources:
requests:
memory: "8Gi"
cpu: "2"
nvidia.com/gpu: "1" # Request exactly 1 GPU
limits:
memory: "16Gi"
cpu: "4"
nvidia.com/gpu: "1" # Limit to 1 GPU (don't be greedy)
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
# --- Model Caching ---
# PersistentVolumeClaim so models survive pod restarts.
# Without this, every pod restart re-downloads 4GB. With it, instant startup.
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
# Schedule only on GPU nodes
nodeSelector:
gpu: "true"
# Tolerate the GPU node taint
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
# PVC for model caching
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi # Enough for SD + LLM model weights
---
# Service to expose the deployment
apiVersion: v1
kind: Service
metadata:
name: stable-diffusion-service
spec:
selector:
app: stable-diffusion
ports:
- port: 80
targetPort: 8000
type: ClusterIP
07 The Learning Roadmap
LLMOps is a deep field. Here's a structured 8-week path from "I have Docker installed" to "I run production AI infrastructure."
Phase 1: Foundations (Weeks 1-2)
Goal: Get comfortable with the tools.
- Docker fundamentals: build, run, volume mounts, multi-stage builds
- GPU in Docker: install nvidia-container-toolkit, run
nvidia-smiinside a container - FastAPI: build a simple REST API, understand async/await, add Pydantic validation
- Milestone: Run a PyTorch model inside a Docker container and hit it with curl
Phase 2: Model Serving (Weeks 3-4)
Goal: Deploy real models that people can use.
- Deploy an image classifier (ResNet, easy win)
- Deploy Stable Diffusion (follow Section 3 of this guide)
- Deploy an LLM with LitServe (follow Section 4)
- Benchmark: measure latency, throughput, VRAM usage
- Milestone: Two working APIs (image gen + text gen) with health checks
Phase 3: Production Engineering (Weeks 5-6)
Goal: Make it survive real-world conditions.
- Add monitoring (Prometheus + Grafana)
- Set up CI/CD (GitHub Actions → build image → deploy)
- Kubernetes basics: deployments, services, ingress
- Load testing: use Locust to find your breaking point
- Milestone: A fully monitored deployment with automated rollouts
Phase 4: Advanced LLMOps (Weeks 7-8)
Goal: Operate AI systems at scale.
- Model versioning: A/B test different models behind the same API
- Prompt versioning: track prompt changes like code changes
- Cost optimization: quantization, caching, spot instances
- Guardrails: input validation, output filtering, circuit breakers
- Milestone: A/B testing two models with cost tracking and guardrails
Phase 5: Capstone Project
Goal: Build something end-to-end that demonstrates your skills.
08 Cost Optimization
GPU compute is expensive. A single A100 on AWS costs ~$3/hour. Running it 24/7 for a month is $2,160. Here's how to cut that by 60-80% without sacrificing quality.
| Strategy | Impact | Complexity |
|---|---|---|
| Model Quantization (INT8/INT4) | 40-60% VRAM reduction, use smaller/cheaper GPUs | Low — tools like bitsandbytes make it one-line |
| Response Caching (Redis) | 30-70% fewer inference calls for repeated queries | Low — hash the prompt, cache the response |
| Spot/Preemptible Instances | 60-80% cheaper than on-demand GPU instances | Medium — need graceful handling of interruptions |
| Auto-scaling to Zero | 90%+ savings during off-peak hours | Medium — cold start latency is the tradeoff |
| Model Distillation | Use a smaller model that mimics the large one | High — requires training data and evaluation |
| Request Batching | 2-4x throughput improvement on the same GPU | Medium — need to handle variable latency |
Quick Reference
Essential commands you'll use daily when operating model servers:
# Monitor GPU usage in real-time
watch -n 1 nvidia-smi
# Check Docker container GPU access
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
# Build and push to registry
docker build -t your-registry/sd-api:v1.2 -f Dockerfile.sd .
docker push your-registry/sd-api:v1.2
# K8s: rolling update with zero downtime
kubectl set image deployment/stable-diffusion sd-api=your-registry/sd-api:v1.2
# K8s: check pod status and GPU allocation
kubectl get pods -l app=stable-diffusion -o wide
kubectl describe node | grep -A 5 "nvidia.com/gpu"
# Load test with curl (quick and dirty)
for i in $(seq 1 10); do
curl -s -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "a test image"}' &
done
wait
# Check Prometheus metrics
curl -s http://localhost:8000/metrics | grep inference_latency
Wrapping Up
We've covered a lot of ground: from a bare Python model handler to a production Kubernetes deployment with monitoring, rate limiting, and cost controls. But here's what I want you to take away:
LLMOps is not about memorizing YAML files. It's about understanding the why behind every decision. Why do we use a lifespan handler? Because the model must be loaded before serving traffic. Why rate limiting? Because GPU inference is slow and expensive. Why Prometheus? Because you can't optimize what you can't measure.
Every code block in this guide has comments explaining the reasoning. When you understand the reasoning, you can adapt to any tool, any cloud provider, any model.
Start with Section 3. Get a Stable Diffusion API running locally. Then work your way through the rest. The roadmap in Section 7 will keep you on track.
LLMOps = Model Serving + Infrastructure + Monitoring + Cost Control + Guardrails