The Practitioner's Guide to LLMOps

Why This Blog Exists

Here's the uncomfortable truth about most "deploy your LLM" tutorials: they show you a model.generate() call in a Jupyter notebook and call it deployment. That's like saying you know how to cook because you can microwave ramen.

Real deployment means your model survives real traffic, restarts gracefully, scales under load, and doesn't bankrupt you with GPU costs at 3 AM.

This guide takes you through the full journey — from writing your first model handler to deploying on Kubernetes with monitoring, rate limiting, and cost controls. Every code block is production-tested. Every architecture decision is explained with the why, not just the what.

Let's ship something real.

01 What is LLMOps?

Think of it like running a restaurant. MLOps is the kitchen — you train models, validate them, and serve predictions. LLMOps is the entire restaurant operation: the kitchen, the waitstaff, the reservation system, the supply chain, the health inspections, and the accountant making sure you don't go bankrupt buying wagyu beef for every dish.

LLMs are fundamentally different from traditional ML models. They're massive, expensive, non-deterministic, and they talk back. That changes everything about how you operate them.

MLOps vs LLMOps

Dimension	Traditional MLOps	LLMOps
Model Size	MBs to low GBs	GBs to hundreds of GBs
Training	Train from scratch, retrain often	Fine-tune or prompt-engineer a foundation model
Inference Cost	Cheap (CPU often sufficient)	Expensive (GPU required, cost per token)
Evaluation	Accuracy, F1, AUC — well-defined metrics	Vibes-based + human eval + LLM-as-judge
Versioning	Model weights + data	Prompts + model version + RAG corpus + guardrails
Failure Modes	Wrong prediction	Hallucination, prompt injection, toxic output, cost explosion
Scaling	Horizontal, relatively simple	GPU-bound, requires batching, quantization, caching

The LLMOps Lifecycle

┌───────────┐    ┌───────────┐    ┌───────────┐    ┌───────────┐
│  SELECT   │ → │  SERVE    │ → │  MONITOR  │ → │  OPTIMIZE │
│  MODEL    │    │  MODEL    │    │  & EVAL   │    │  & SCALE  │
└───────────┘    └───────────┘    └───────────┘    └───────────┘
  Fine-tune       FastAPI /         Latency,          Quantize,
  or use          vLLM /           Cost per           Batch,
  off-shelf       LitServe         request            Cache

The 2025 Reality Check

The LLMOps landscape has matured significantly. Here's what actually matters now:

Context Engineering > Prompt Engineering. It's no longer just about clever prompts. It's about building systems that feed the right context — RAG pipelines, tool outputs, memory — into the model at the right time. The prompt is the last mile, not the whole journey.
Guardrails are non-negotiable. Every production LLM deployment needs input validation, output filtering, and cost circuit breakers. Not as an afterthought — as core infrastructure. One prompt injection in production and your CEO is on the phone.
Software Engineering > ML Engineering. The bottleneck in LLMOps isn't model performance — it's building reliable systems around unreliable models. Retry logic, graceful degradation, queue management, observability. The boring stuff is what keeps you in production.

02 Setting Up Your Toolkit

Before we deploy anything, let's set up a clean environment with everything we'll need.

Bash setup.sh

# Create project directory
mkdir llmops-stack && cd llmops-stack

# Python environment (using uv for speed)
uv init
uv add fastapi uvicorn pillow torch diffusers transformers
uv add litserve aiohttp pydantic python-multipart
uv add prometheus-client structlog

# Docker (make sure you have Docker Desktop with GPU support)
docker --version   # Need 24.0+
nvidia-smi         # Verify GPU access

# Kubernetes tools (for later)
# brew install kubectl helm k9s

Here's the project structure we'll build throughout this guide:

llmops-stack/
├── models/
│   ├── sd_handler.py          # Stable Diffusion model handler
│   └── llm_handler.py         # LLM model handler
├── api/
│   ├── sd_server.py           # Stable Diffusion FastAPI server
│   └── llm_server.py          # LLM LitServe server
├── middleware/
│   ├── rate_limiter.py        # Rate limiting
│   ├── queue_handler.py       # Request queue
│   └── metrics.py             # Prometheus metrics
├── docker/
│   ├── Dockerfile.sd          # SD container
│   ├── Dockerfile.llm         # LLM container
│   └── docker-compose.yml     # Full stack orchestration
├── k8s/
│   └── sd-deployment.yaml     # Kubernetes manifests
└── tests/
    ├── test_sd_api.py
    └── test_llm_api.py

03 Deploy Stable Diffusion as an API

Let's start with something visual and satisfying — turning Stable Diffusion into a production API. This section covers the full journey: model handler, API server, containerization, and orchestration.

Step 1: The Model Handler

The model handler is responsible for loading the model into GPU memory and running inference. Every decision here has a production reason behind it.

Python sd_handler.py

import torch
from diffusers import StableDiffusionPipeline
from PIL import Image
import io
import base64
import logging
from typing import Optional

logger = logging.getLogger(__name__)


class StableDiffusionHandler:
    """Production-ready Stable Diffusion model handler.

    Design decisions:
    - Singleton pattern: Loading a 4GB model per request would be insane.
      We load once at startup and reuse across all requests.
    - float16 precision: Cuts VRAM usage in half with negligible quality loss.
      A100 users can try bfloat16 for slightly better quality.
    - Safety checker disabled: For internal/controlled deployments.
      Re-enable for public-facing APIs (legal liability is real).
    - attention_slicing: Trades ~10% speed for ~30% less VRAM.
      Essential on consumer GPUs (RTX 3090, 4090).
    """

    def __init__(
        self,
        model_id: str = "stabilityai/stable-diffusion-2-1",
        device: str = "cuda",
        dtype: torch.dtype = torch.float16,
    ):
        self.model_id = model_id
        self.device = device
        self.dtype = dtype
        self.pipeline: Optional[StableDiffusionPipeline] = None

    def load_model(self) -> None:
        """Load the model into GPU memory.

        This is called ONCE at server startup, not per request.
        On a T4 GPU, this takes ~15 seconds. On an A100, ~5 seconds.
        """
        logger.info(f"Loading model: {self.model_id}")
        logger.info(f"Device: {self.device}, Dtype: {self.dtype}")

        self.pipeline = StableDiffusionPipeline.from_pretrained(
            self.model_id,
            torch_dtype=self.dtype,
            # safety_checker=None,  # Uncomment for internal use only
        )
        self.pipeline = self.pipeline.to(self.device)

        # Memory optimization: essential on GPUs with <24GB VRAM
        self.pipeline.enable_attention_slicing()

        # Optional: even more VRAM savings at cost of speed
        # self.pipeline.enable_vae_slicing()
        # self.pipeline.enable_model_cpu_offload()

        logger.info("Model loaded successfully")

    def generate(
        self,
        prompt: str,
        negative_prompt: str = "blurry, low quality, distorted",
        num_inference_steps: int = 30,
        guidance_scale: float = 7.5,
        width: int = 512,
        height: int = 512,
    ) -> str:
        """Generate an image and return it as a base64 string.

        Why base64? Three reasons:
        1. No file system dependency -- works in containers with read-only fs
        2. Easy to embed in JSON API responses
        3. Avoids dealing with file cleanup and temp directories

        For production with high volume, consider streaming to S3/GCS instead.
        """
        if self.pipeline is None:
            raise RuntimeError("Model not loaded. Call load_model() first.")

        # Generate the image
        with torch.no_grad():  # Saves ~20% VRAM during inference
            result = self.pipeline(
                prompt=prompt,
                negative_prompt=negative_prompt,
                num_inference_steps=num_inference_steps,
                guidance_scale=guidance_scale,
                width=width,
                height=height,
            )

        image: Image.Image = result.images[0]

        # Convert to base64 for API response
        buffer = io.BytesIO()
        image.save(buffer, format="PNG")
        img_base64 = base64.b64encode(buffer.getvalue()).decode("utf-8")

        return img_base64

    def health_check(self) -> dict:
        """Verify the model is loaded and GPU is accessible."""
        return {
            "status": "healthy" if self.pipeline else "not_loaded",
            "model": self.model_id,
            "device": self.device,
            "gpu_memory_allocated": f"{torch.cuda.memory_allocated() / 1e9:.2f} GB"
                if torch.cuda.is_available() else "N/A",
        }

Key insight: The model handler is intentionally separate from the API server. This separation means you can swap out the model (SD 1.5 to SDXL to SD3) without touching your API code. It also makes testing trivial — mock the handler, test the API.

Step 2: The FastAPI Server

Now let's wrap our handler in a proper API server with health checks, request validation, and structured logging.

Python api_server.py

from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
import time
import logging
from sd_handler import StableDiffusionHandler

logger = logging.getLogger(__name__)

# --- Model lifecycle management ---
# Why lifespan? Because we need the model loaded BEFORE any request
# hits the server. The old @app.on_event("startup") is deprecated.

handler = StableDiffusionHandler()

@asynccontextmanager
async def lifespan(app: FastAPI):
    """Load model at startup, clean up at shutdown."""
    handler.load_model()  # ~15 seconds on T4, blocks until ready
    yield
    # Cleanup: free GPU memory
    del handler.pipeline
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

app = FastAPI(
    title="Stable Diffusion API",
    version="1.0.0",
    lifespan=lifespan,
)


# --- Request/Response Models ---
# Pydantic handles validation so we don't have to.
# A 1024x1024 image with 50 steps will OOM on a T4 -- so we set limits.

class GenerateRequest(BaseModel):
    prompt: str = Field(..., min_length=1, max_length=500)
    negative_prompt: str = Field(
        default="blurry, low quality, distorted",
        max_length=500,
    )
    num_inference_steps: int = Field(default=30, ge=1, le=50)
    guidance_scale: float = Field(default=7.5, ge=1.0, le=20.0)
    width: int = Field(default=512, ge=256, le=768)
    height: int = Field(default=512, ge=256, le=768)


class GenerateResponse(BaseModel):
    image_base64: str
    generation_time: float
    parameters: dict


# --- Endpoints ---

@app.get("/health")
async def health():
    """Health check -- used by Docker, K8s, and load balancers.

    Why a dedicated endpoint? Because GET / returning 200 doesn't
    mean the model is loaded. This actually checks GPU state.
    """
    return handler.health_check()


@app.post("/generate", response_model=GenerateResponse)
async def generate(request: GenerateRequest):
    """Generate an image from a text prompt.

    Why async def for a sync operation? FastAPI runs sync functions
    in a thread pool automatically, but we mark it async because
    the handler itself is CPU/GPU-bound and we don't want to
    block the event loop's thread pool unnecessarily.
    """
    start_time = time.time()

    try:
        image_b64 = handler.generate(
            prompt=request.prompt,
            negative_prompt=request.negative_prompt,
            num_inference_steps=request.num_inference_steps,
            guidance_scale=request.guidance_scale,
            width=request.width,
            height=request.height,
        )
    except RuntimeError as e:
        logger.error(f"Generation failed: {e}")
        raise HTTPException(status_code=503, detail="Model not ready")
    except Exception as e:
        logger.error(f"Unexpected error: {e}")
        raise HTTPException(status_code=500, detail=str(e))

    generation_time = time.time() - start_time

    return GenerateResponse(
        image_base64=image_b64,
        generation_time=round(generation_time, 2),
        parameters=request.model_dump(),
    )


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Step 3: Dockerize It

A model that only runs on your laptop isn't deployed. Let's containerize it with a production-optimized Dockerfile.

Dockerfile Dockerfile.sd

# Why this base image? It includes CUDA runtime, cuDNN, and Ubuntu 22.04.
# The "runtime" variant is ~3GB smaller than "devel" -- we don't need
# the CUDA compiler for inference.
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

# Prevent interactive prompts during apt-get install
ENV DEBIAN_FRONTEND=noninteractive

# Install Python 3.11 -- not 3.12, because torch wheels
# for CUDA 12.1 are most stable on 3.11 as of early 2026.
RUN apt-get update && apt-get install -y \
    python3.11 \
    python3.11-venv \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# --- Layer caching strategy ---
# Copy requirements FIRST. Docker caches layers, so if only your
# code changes (not dependencies), this layer is reused.
# This saves 10-15 minutes on rebuilds.
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Now copy the application code (changes more often)
COPY . .

# Pre-download the model at build time.
# Why? Because downloading 4GB at container start means 15 minutes
# before your first request. With this, the model is baked into the image.
# Tradeoff: larger image (~8GB), but instant startup.
RUN python3 -c "from diffusers import StableDiffusionPipeline; \
    StableDiffusionPipeline.from_pretrained('stabilityai/stable-diffusion-2-1')"

EXPOSE 8000

# Use uvicorn with 1 worker. Why just 1?
# Because each worker loads the full model into VRAM.
# 1 model = ~4GB VRAM. 2 workers = OOM on most GPUs.
# Scale horizontally (more containers) instead of vertically (more workers).
CMD ["uvicorn", "api_server:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

Step 4: Docker Compose

For local development and testing, Docker Compose orchestrates everything with one command.

YAML docker-compose.yml

version: "3.8"

services:
  stable-diffusion:
    build:
      context: .
      dockerfile: Dockerfile.sd
    ports:
      - "8000:8000"
    volumes:
      # Cache models on host so rebuilds don't re-download 4GB
      - model-cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s    # Give the model time to load
    restart: unless-stopped
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - TRANSFORMERS_CACHE=/root/.cache/huggingface

volumes:
  model-cache:

Build and Run

Bash terminal

# Build the image (first time takes ~20 minutes due to model download)
docker compose build

# Start the service
docker compose up -d

# Watch the logs (wait for "Model loaded successfully")
docker compose logs -f stable-diffusion

# Test it!
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "a photo of an astronaut riding a horse on mars, cinematic lighting"}'

# Check GPU usage
nvidia-smi

Pro tip: The first build takes ages because it downloads the model. After that, Docker layer caching means rebuilds for code-only changes take under 30 seconds. This is why the COPY requirements.txt and COPY . . are on separate lines.

04 Deploy an LLM Chat API

Image generation is cool, but most production workloads are text. Let's deploy an LLM with an OpenAI-compatible API using LitServe.

Why LitServe?

You could use vLLM, TGI, or raw FastAPI. Here's why LitServe hits a sweet spot for learning:

Batteries included — batching, streaming, GPU management out of the box
OpenAI-compatible — your existing client code just works
Pythonic — no YAML configs, no custom serving formats, just a Python class
Production-tested — used by Lightning AI in production at scale

Python api_server.py

import litserve as ls
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer


class LLMEngine:
    """Handles model loading and raw inference.

    Separated from the serving layer so you can:
    - Test the engine independently
    - Swap models without changing the API
    - Reuse across different serving frameworks
    """

    def __init__(self, model_id: str, device: str = "cuda"):
        self.model_id = model_id
        self.device = device
        self.model = None
        self.tokenizer = None

    def load(self):
        """Load model and tokenizer into GPU memory."""
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_id)
        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_id,
            torch_dtype=torch.float16,
            device_map="auto",  # Automatically handles multi-GPU sharding
        )
        self.model.eval()  # Disable dropout -- we're doing inference, not training

    def generate(self, prompt: str, max_tokens: int = 512, temperature: float = 0.7) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                temperature=temperature,
                do_sample=True,
                top_p=0.9,
                pad_token_id=self.tokenizer.eos_token_id,
            )

        # Decode only the NEW tokens, not the input prompt
        new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
        return self.tokenizer.decode(new_tokens, skip_special_tokens=True)


class LLMServingAPI(ls.LitAPI):
    """LitServe API wrapper.

    LitServe calls these methods in order:
    1. setup()     -- called once at startup (load model)
    2. decode()    -- parse incoming request
    3. predict()   -- run inference
    4. encode()    -- format response

    This structure forces clean separation of concerns.
    """

    def setup(self, device: str):
        """Called once when the server starts."""
        self.engine = LLMEngine(
            model_id="microsoft/phi-2",  # Small enough for a T4, smart enough to be useful
            device=device,
        )
        self.engine.load()

    def decode_request(self, request: dict) -> dict:
        """Parse and validate the incoming request."""
        return {
            "prompt": request["messages"][-1]["content"],
            "max_tokens": request.get("max_tokens", 512),
            "temperature": request.get("temperature", 0.7),
        }

    def predict(self, inputs: dict) -> str:
        """Run inference -- this is where the GPU does its thing."""
        return self.engine.generate(
            prompt=inputs["prompt"],
            max_tokens=inputs["max_tokens"],
            temperature=inputs["temperature"],
        )

    def encode_response(self, output: str) -> dict:
        """Format response as OpenAI-compatible JSON."""
        return {
            "choices": [{
                "message": {
                    "role": "assistant",
                    "content": output,
                }
            }]
        }


if __name__ == "__main__":
    api = LLMServingAPI()
    server = ls.LitServer(
        api,
        accelerator="gpu",
        devices=1,
        timeout=60,            # Kill requests that take >60s
        max_batch_size=4,     # Batch up to 4 requests together
        batch_timeout=0.05,   # Wait max 50ms to fill a batch
    )
    server.run(port=8001)

Client Code

Because LitServe speaks the OpenAI protocol, you can use the official OpenAI Python client:

Python client.py

from openai import OpenAI

# Point to your local LLM server instead of OpenAI
client = OpenAI(
    base_url="http://localhost:8001/v1",
    api_key="not-needed",  # LitServe doesn't require auth by default
)

response = client.chat.completions.create(
    model="phi-2",  # This is ignored by LitServe but required by the client
    messages=[
        {"role": "user", "content": "Explain Docker in 3 sentences."}
    ],
    max_tokens=256,
    temperature=0.7,
)

print(response.choices[0].message.content)

Why this matters: By using an OpenAI-compatible interface, you can swap between your self-hosted model and OpenAI's API by changing a single line (the base_url). This is crucial for hybrid deployments where you route cheap queries to your local model and complex ones to GPT-4.

LLM Dockerfile

Dockerfile Dockerfile.llm

FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
    python3.11 python3.11-venv python3-pip \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements-llm.txt .
RUN pip install --no-cache-dir -r requirements-llm.txt

COPY . .

# Pre-download model weights at build time
RUN python3 -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
    AutoTokenizer.from_pretrained('microsoft/phi-2'); \
    AutoModelForCausalLM.from_pretrained('microsoft/phi-2')"

EXPOSE 8001

CMD ["python3", "api_server.py"]

05 Production Hardening

You have a working API. Now let's make it survive the real world. Production means handling abusive clients, managing request queues when the GPU is saturated, and knowing when things go wrong before your users tell you.

Rate Limiting

Without rate limiting, one client with a while True loop will DoS your GPU and nobody else gets served. This middleware uses a simple sliding window counter per client IP.

Python middleware.py

import time
from collections import defaultdict
from fastapi import Request, HTTPException
from starlette.middleware.base import BaseHTTPMiddleware


class RateLimitMiddleware(BaseHTTPMiddleware):
    """Sliding window rate limiter per client IP.

    Why not use a library? Because understanding rate limiting is
    essential for LLMOps. GPU inference is SLOW (2-30 seconds per
    request) -- you can't handle 1000 req/s like a REST API.

    For production at scale, use Redis-backed rate limiting.
    This in-memory version works for single-instance deployments.
    """

    def __init__(self, app, requests_per_minute: int = 10):
        super().__init__(app)
        self.requests_per_minute = requests_per_minute
        self.clients: dict[str, list[float]] = defaultdict(list)

    async def dispatch(self, request: Request, call_next):
        # Get client IP (handle proxies)
        client_ip = request.client.host
        forwarded = request.headers.get("X-Forwarded-For")
        if forwarded:
            client_ip = forwarded.split(",")[0].strip()

        now = time.time()
        window_start = now - 60  # 1-minute sliding window

        # Remove timestamps outside the window
        self.clients[client_ip] = [
            t for t in self.clients[client_ip]
            if t > window_start
        ]

        # Check if over limit
        if len(self.clients[client_ip]) >= self.requests_per_minute:
            raise HTTPException(
                status_code=429,
                detail="Rate limit exceeded. Try again in 60 seconds.",
            )

        # Record this request
        self.clients[client_ip].append(now)

        return await call_next(request)

Request Queue

GPU inference is inherently serial (one model, one GPU, one request at a time). When requests arrive faster than the GPU can process them, you need a queue. Without one, requests pile up, timeouts cascade, and everything falls over.

Python queue_handler.py

import asyncio
import time
from dataclasses import dataclass, field
from typing import Any
import logging

logger = logging.getLogger(__name__)


@dataclass
class InferenceRequest:
    """A single queued inference request."""
    request_id: str
    payload: dict
    future: asyncio.Future
    enqueued_at: float = field(default_factory=time.time)


class InferenceQueue:
    """Async queue for GPU inference requests.

    Why a custom queue instead of just asyncio.Queue?
    - Position tracking: clients can poll "where am I in line?"
    - Timeout handling: stale requests get dropped, not processed
    - Backpressure: reject new requests when the queue is full
      instead of accepting them and timing out later (which wastes
      everyone's time)
    """

    def __init__(self, max_size: int = 50, request_timeout: float = 120.0):
        self.queue: asyncio.Queue = asyncio.Queue(maxsize=max_size)
        self.request_timeout = request_timeout
        self.active_requests: dict[str, InferenceRequest] = {}

    async def enqueue(self, request_id: str, payload: dict) -> Any:
        """Add a request to the queue and wait for the result."""
        if self.queue.full():
            raise RuntimeError("Queue is full. Try again later.")

        future = asyncio.get_event_loop().create_future()
        req = InferenceRequest(
            request_id=request_id,
            payload=payload,
            future=future,
        )

        self.active_requests[request_id] = req
        await self.queue.put(req)

        logger.info(f"Enqueued {request_id}, queue size: {self.queue.qsize()}")

        try:
            result = await asyncio.wait_for(future, timeout=self.request_timeout)
            return result
        except asyncio.TimeoutError:
            logger.warning(f"Request {request_id} timed out")
            raise RuntimeError("Request timed out")
        finally:
            self.active_requests.pop(request_id, None)

    async def process_loop(self, handler_func):
        """Worker loop that processes requests one at a time.

        Why sequential? Because the GPU can only run one inference
        at a time (unless you're doing continuous batching, which
        is a whole other level of complexity).
        """
        while True:
            req = await self.queue.get()

            # Skip if the client already disconnected
            if req.future.done():
                continue

            try:
                result = await handler_func(req.payload)
                req.future.set_result(result)
            except Exception as e:
                req.future.set_exception(e)

    def get_position(self, request_id: str) -> int:
        """Get the position of a request in the queue."""
        return self.queue.qsize()

Monitoring

If you can't measure it, you can't improve it. Prometheus metrics give you visibility into exactly what your model server is doing.

Python metrics.py

from prometheus_client import (
    Counter, Histogram, Gauge, CollectorRegistry, generate_latest
)
import time
import functools


class ServerMetrics:
    """Prometheus metrics for model serving.

    These four metrics tell you everything:
    - request_count: Are people using this? Is traffic growing?
    - request_latency: How fast is inference? Are we degrading?
    - gpu_utilization: Are we wasting money on idle GPUs?
    - queue_size: Are requests backing up?
    """

    def __init__(self):
        self.registry = CollectorRegistry()

        self.request_count = Counter(
            "inference_requests_total",
            "Total number of inference requests",
            ["model", "status"],
            registry=self.registry,
        )

        self.request_latency = Histogram(
            "inference_latency_seconds",
            "Inference latency in seconds",
            ["model"],
            # Custom buckets tuned for GPU inference (not web requests)
            buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0],
            registry=self.registry,
        )

        self.gpu_utilization = Gauge(
            "gpu_utilization_percent",
            "GPU utilization percentage",
            registry=self.registry,
        )

        self.queue_size = Gauge(
            "inference_queue_size",
            "Current number of requests in the queue",
            registry=self.registry,
        )

    def track_request(self, model_name: str):
        """Decorator to track request count and latency."""
        def decorator(func):
            @functools.wraps(func)
            async def wrapper(*args, **kwargs):
                start = time.time()
                try:
                    result = await func(*args, **kwargs)
                    self.request_count.labels(
                        model=model_name, status="success"
                    ).inc()
                    return result
                except Exception as e:
                    self.request_count.labels(
                        model=model_name, status="error"
                    ).inc()
                    raise
                finally:
                    self.request_latency.labels(
                        model=model_name
                    ).observe(time.time() - start)
            return wrapper
        return decorator

    def get_metrics(self) -> bytes:
        """Export metrics in Prometheus format."""
        return generate_latest(self.registry)

The monitoring stack: Prometheus scrapes the /metrics endpoint every 15s, Grafana visualizes it, and Alertmanager pages you at 3 AM when latency spikes. Set up alerts for: p99 latency > 30s, error rate > 5%, queue size > 20, GPU utilization < 10% (you're wasting money).

06 Deploying to Kubernetes

Docker Compose works on a single machine. When you need multiple replicas, rolling updates, and automatic recovery, it's time for Kubernetes. Here's a production-ready deployment manifest with every decision explained.

YAML sd-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: stable-diffusion
  labels:
    app: stable-diffusion
spec:
  replicas: 2                    # 2 GPUs = 2 replicas. Scale by adding GPUs.
  selector:
    matchLabels:
      app: stable-diffusion
  template:
    metadata:
      labels:
        app: stable-diffusion
    spec:
      containers:
        - name: sd-api
          image: your-registry/sd-api:latest
          ports:
            - containerPort: 8000

          # --- Probes ---
          # readinessProbe: "Is this pod ready to receive traffic?"
          # The model takes ~15s to load. Without this probe, K8s would
          # send traffic to a pod that hasn't loaded the model yet = 503 errors.
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 30    # Give the model time to load
            periodSeconds: 10
            failureThreshold: 3

          # livenessProbe: "Is this pod still alive?"
          # If inference hangs (GPU deadlock, OOM), K8s will restart the pod.
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 30
            failureThreshold: 3

          resources:
            requests:
              memory: "8Gi"
              cpu: "2"
              nvidia.com/gpu: "1"    # Request exactly 1 GPU
            limits:
              memory: "16Gi"
              cpu: "4"
              nvidia.com/gpu: "1"    # Limit to 1 GPU (don't be greedy)

          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface

          env:
            - name: CUDA_VISIBLE_DEVICES
              value: "0"

      # --- Model Caching ---
      # PersistentVolumeClaim so models survive pod restarts.
      # Without this, every pod restart re-downloads 4GB. With it, instant startup.
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc

      # Schedule only on GPU nodes
      nodeSelector:
        gpu: "true"

      # Tolerate the GPU node taint
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule

---
# PVC for model caching
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi              # Enough for SD + LLM model weights

---
# Service to expose the deployment
apiVersion: v1
kind: Service
metadata:
  name: stable-diffusion-service
spec:
  selector:
    app: stable-diffusion
  ports:
    - port: 80
      targetPort: 8000
  type: ClusterIP

Scaling strategy: Don't try to autoscale GPU workloads the same way you autoscale web servers. GPU nodes take 5-10 minutes to provision. Instead, use queue-based scaling: monitor queue depth, and if it stays above 10 for more than 2 minutes, spin up a new replica. Scale down when idle for 15+ minutes to save costs.

07 The Learning Roadmap

LLMOps is a deep field. Here's a structured 8-week path from "I have Docker installed" to "I run production AI infrastructure."

Phase 1: Foundations (Weeks 1-2)

Goal: Get comfortable with the tools.

Docker fundamentals: build, run, volume mounts, multi-stage builds
GPU in Docker: install nvidia-container-toolkit, run nvidia-smi inside a container
FastAPI: build a simple REST API, understand async/await, add Pydantic validation
Milestone: Run a PyTorch model inside a Docker container and hit it with curl

Phase 2: Model Serving (Weeks 3-4)

Goal: Deploy real models that people can use.

Deploy an image classifier (ResNet, easy win)
Deploy Stable Diffusion (follow Section 3 of this guide)
Deploy an LLM with LitServe (follow Section 4)
Benchmark: measure latency, throughput, VRAM usage
Milestone: Two working APIs (image gen + text gen) with health checks

Phase 3: Production Engineering (Weeks 5-6)

Goal: Make it survive real-world conditions.

Add monitoring (Prometheus + Grafana)
Set up CI/CD (GitHub Actions → build image → deploy)
Kubernetes basics: deployments, services, ingress
Load testing: use Locust to find your breaking point
Milestone: A fully monitored deployment with automated rollouts

Phase 4: Advanced LLMOps (Weeks 7-8)

Goal: Operate AI systems at scale.

Model versioning: A/B test different models behind the same API
Prompt versioning: track prompt changes like code changes
Cost optimization: quantization, caching, spot instances
Guardrails: input validation, output filtering, circuit breakers
Milestone: A/B testing two models with cost tracking and guardrails

Phase 5: Capstone Project

Goal: Build something end-to-end that demonstrates your skills.

Capstone idea: Build a multi-model serving platform that routes requests to different models based on complexity. Simple questions go to a small, fast model (Phi-2). Complex questions go to a large model (Llama-3). Image requests go to Stable Diffusion. Add cost tracking per user, rate limiting, and a simple dashboard. This single project demonstrates every skill in the LLMOps toolkit.

08 Cost Optimization

GPU compute is expensive. A single A100 on AWS costs ~$3/hour. Running it 24/7 for a month is $2,160. Here's how to cut that by 60-80% without sacrificing quality.

Strategy	Impact	Complexity
Model Quantization (INT8/INT4)	40-60% VRAM reduction, use smaller/cheaper GPUs	Low — tools like bitsandbytes make it one-line
Response Caching (Redis)	30-70% fewer inference calls for repeated queries	Low — hash the prompt, cache the response
Spot/Preemptible Instances	60-80% cheaper than on-demand GPU instances	Medium — need graceful handling of interruptions
Auto-scaling to Zero	90%+ savings during off-peak hours	Medium — cold start latency is the tradeoff
Model Distillation	Use a smaller model that mimics the large one	High — requires training data and evaluation
Request Batching	2-4x throughput improvement on the same GPU	Medium — need to handle variable latency

Quick Reference

Essential commands you'll use daily when operating model servers:

Bash commands.sh

# Monitor GPU usage in real-time
watch -n 1 nvidia-smi

# Check Docker container GPU access
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

# Build and push to registry
docker build -t your-registry/sd-api:v1.2 -f Dockerfile.sd .
docker push your-registry/sd-api:v1.2

# K8s: rolling update with zero downtime
kubectl set image deployment/stable-diffusion sd-api=your-registry/sd-api:v1.2

# K8s: check pod status and GPU allocation
kubectl get pods -l app=stable-diffusion -o wide
kubectl describe node | grep -A 5 "nvidia.com/gpu"

# Load test with curl (quick and dirty)
for i in $(seq 1 10); do
  curl -s -X POST http://localhost:8000/generate \
    -H "Content-Type: application/json" \
    -d '{"prompt": "a test image"}' &
done
wait

# Check Prometheus metrics
curl -s http://localhost:8000/metrics | grep inference_latency

Wrapping Up

We've covered a lot of ground: from a bare Python model handler to a production Kubernetes deployment with monitoring, rate limiting, and cost controls. But here's what I want you to take away:

LLMOps is not about memorizing YAML files. It's about understanding the why behind every decision. Why do we use a lifespan handler? Because the model must be loaded before serving traffic. Why rate limiting? Because GPU inference is slow and expensive. Why Prometheus? Because you can't optimize what you can't measure.

Every code block in this guide has comments explaining the reasoning. When you understand the reasoning, you can adapt to any tool, any cloud provider, any model.

Start with Section 3. Get a Stable Diffusion API running locally. Then work your way through the rest. The roadmap in Section 7 will keep you on track.

LLMOps = Model Serving + Infrastructure + Monitoring + Cost Control + Guardrails