Building a Multimodal Vision-Language Model from Scratch
Gemma-270M + CLIP Vision = A tiny model that sees and speaks
Why This Blog Exists
Most tutorials on vision-language models start with “pip install transformers” and end with calling a pretrained model. That’s fine for applications — but it teaches you nothing about how these systems actually work.
This blog documents how I built a complete multimodal vision-language model from scratch — combining Google’s Gemma-270M language model with OpenAI’s CLIP vision encoder, training it on 157K image-text pairs from the LLaVA dataset, and deploying it to HuggingFace Spaces with a full CI/CD pipeline. The entire project is open-source at github.com/sagar431/multimodal-gemma-270m.
The key insight: you don’t need a 70B parameter model to understand vision-language architectures. A 270M parameter model trained on a single A100 for 9 hours can teach you everything about how images and text are fused, how projector layers bridge modalities, and why LoRA makes fine-tuning practical.
01 The Architecture — How Images and Text Become One
The Core Idea
A multimodal model needs to solve one fundamental problem: images live in pixel space (continuous, spatial, high-dimensional), while text lives in token space (discrete, sequential, vocabulary-indexed). How do you get them to talk to each other?
The answer, pioneered by LLaVA (Liu et al., 2023), is elegantly simple:
- Use a pretrained vision encoder (CLIP) to convert images into a sequence of feature vectors
- Use a projection layer (MLP) to map those vision features into the language model’s embedding space
- Interleave the projected image tokens with text tokens
- Let the language model process everything as if it were all text
Image (224×224×3)
│
▾
CLIP ViT-Large/14 (frozen, 428M params)
│
▾
Patch Tokens: [batch, 256, 1024]
│
▾
Vision Projector MLP (trainable)
│
▾
Projected Tokens: [batch, 256, 1536]
│
├—— merged with ——▸ Text Tokens: [batch, seq_len, 1536]
│
▾
Gemma-270M Language Model (LoRA adapters, trainable)
│
▾
Generated Text Response
Why These Specific Models?
| Component | Model | Parameters | Why This One |
|---|---|---|---|
| Vision Encoder | CLIP ViT-Large/14 | 428M (frozen) | Best open vision encoder, trained on 400M image-text pairs |
| Language Model | Gemma-270M | 270M (LoRA) | Small enough for single-GPU, powerful enough to generate coherent text |
| Projector | 2-layer MLP | ~6M (trainable) | Simple, proven in LLaVA, fast to train |
| Total | — | 539M | Only 18.6M trainable (3.4%) |
The Vision Projector — Bridging Two Worlds
The projector is deceptively important. It’s “just” a 2-layer MLP, but it’s doing the critical job of translating CLIP’s visual language into Gemma’s text language.
class VisionProjector(nn.Module):
"""Projects vision features to language model embedding space.
Uses a 2-layer MLP with GELU activation following the LLaVA architecture."""
def __init__(self, vision_dim, language_dim, hidden_dim=None, dropout=0.0):
super().__init__()
hidden_dim = hidden_dim or (language_dim * 2)
self.fc1 = nn.Linear(vision_dim, hidden_dim)
self.act = nn.GELU()
self.fc2 = nn.Linear(hidden_dim, language_dim)
self.dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
self._init_weights()
def _init_weights(self):
for module in [self.fc1, self.fc2]:
if isinstance(module, nn.Linear):
nn.init.xavier_uniform_(module.weight)
if module.bias is not None:
nn.init.zeros_(module.bias)
def forward(self, vision_features):
x = self.fc1(vision_features)
x = self.act(x)
x = self.dropout(x)
x = self.fc2(x)
return x
02 Building the Multimodal Model
The MultimodalGemma Class
This is the core of the entire system. It wires together the vision encoder, projector, and language model into a single differentiable pipeline.
class MultimodalGemma(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self._setup_tokenizer()
self._setup_language_model()
self._setup_vision_components()
self._setup_projectors()
self._freeze_encoders()
self._setup_lora()
Each setup method handles one component. Let me walk through the critical ones:
Setting Up the Vision Encoder
def _setup_vision_components(self):
vision_model_name = self.config["model"]["vision_model_name"]
self.vision_encoder = CLIPVisionModel.from_pretrained(
vision_model_name,
torch_dtype=torch.bfloat16
)
self.vision_processor = CLIPProcessor.from_pretrained(vision_model_name)
CLIP ViT-Large/14 splits a 224×224 image into 16×16 patches (14×14 pixel each), producing 196 patch tokens + 1 CLS token. We use bfloat16 to cut memory in half without meaningful precision loss.
LoRA — Making Fine-Tuning Practical
Full fine-tuning of even a 270M model would update all 270M parameters. With LoRA (Low-Rank Adaptation), we inject small trainable matrices into the attention layers:
def _setup_lora(self):
lora_config = LoraConfig(
r=64, # rank of the low-rank matrices
lora_alpha=128, # scaling factor
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.1,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
self.language_model = get_peft_model(self.language_model, lora_config)
self.language_model.print_trainable_parameters()
# Output: trainable params: 12,582,912 || all params: 270,000,000 || trainable%: 4.66%
The Image-Text Merge — The Heart of Multimodality
This is the most subtle and important function in the entire codebase. When the model receives <image> What is this?, it needs to:
- Convert
<image>into 256 actual vision tokens - Expand the sequence length accordingly
- Update attention masks and labels to match
def encode_images(self, images):
with torch.no_grad():
vision_outputs = self.vision_encoder(pixel_values=images)
# Use ALL patch tokens (skip CLS at index 0)
image_features = vision_outputs.last_hidden_state[:, 1:, :]
projected_features = self.vision_projector(image_features)
return projected_features
def _merge_image_features(self, input_ids, image_features, attention_mask, labels):
batch_size, seq_len = input_ids.shape
num_image_tokens = image_features.shape[1] # 256 patches
device = input_ids.device
text_embeds = self.language_model.get_input_embeddings()(input_ids)
image_token_mask = (input_ids == self.image_token_id)
new_seq_len = seq_len - 1 + num_image_tokens # replace 1 token with 256
new_embeds = torch.zeros(batch_size, new_seq_len, text_embeds.shape[-1],
dtype=text_embeds.dtype, device=device)
new_attention_mask = torch.zeros(batch_size, new_seq_len,
dtype=attention_mask.dtype, device=device)
new_labels = torch.full((batch_size, new_seq_len), -100,
dtype=labels.dtype, device=device) if labels is not None else None
for batch_idx in range(batch_size):
image_positions = torch.where(image_token_mask[batch_idx])[0]
if len(image_positions) > 0:
img_pos = image_positions[0].item()
# Copy text before <image>
new_embeds[batch_idx, :img_pos] = text_embeds[batch_idx, :img_pos]
# Insert all 256 image tokens
new_embeds[batch_idx, img_pos:img_pos + num_image_tokens] = image_features[batch_idx]
# Copy text after <image>
remaining_len = seq_len - img_pos - 1
new_embeds[batch_idx, img_pos + num_image_tokens:img_pos + num_image_tokens + remaining_len] = \
text_embeds[batch_idx, img_pos + 1:img_pos + 1 + remaining_len]
# Update attention mask
new_attention_mask[batch_idx, :img_pos] = attention_mask[batch_idx, :img_pos]
new_attention_mask[batch_idx, img_pos:img_pos + num_image_tokens] = 1
new_attention_mask[batch_idx, img_pos + num_image_tokens:img_pos + num_image_tokens + remaining_len] = \
attention_mask[batch_idx, img_pos + 1:img_pos + 1 + remaining_len]
# Labels: mask image positions with -100 (ignore in loss)
if labels is not None:
new_labels[batch_idx, :img_pos] = labels[batch_idx, :img_pos]
new_labels[batch_idx, img_pos + num_image_tokens:img_pos + num_image_tokens + remaining_len] = \
labels[batch_idx, img_pos + 1:img_pos + 1 + remaining_len]
return new_embeds, new_attention_mask, new_labels
03 The Training Pipeline with PyTorch Lightning
Why Lightning?
PyTorch Lightning removes the boilerplate from training: gradient accumulation, mixed precision, multi-GPU, logging, checkpointing — all handled by the framework. You focus on the model, not the training loop.
The Lightning Module
class MultimodalGemmaLightning(L.LightningModule):
def __init__(self, config):
super().__init__()
self.save_hyperparameters()
self.config = config
self.model = MultimodalGemma(config)
self.training_step_outputs = []
self.validation_step_outputs = []
def forward(self, batch):
return self.model(
input_ids=batch["input_ids"],
attention_mask=batch["attention_mask"],
images=batch.get("images"),
labels=batch["labels"]
)
def training_step(self, batch, batch_idx):
outputs = self(batch)
loss = outputs["loss"]
self.log("train/loss", loss, on_step=True, on_epoch=True, prog_bar=True, sync_dist=True)
self.log("train/learning_rate", self.optimizers().param_groups[0]["lr"], on_step=True)
self.training_step_outputs.append(loss.detach())
return loss
def validation_step(self, batch, batch_idx):
outputs = self(batch)
loss = outputs["loss"]
self.log("val/loss", loss, on_step=False, on_epoch=True, prog_bar=True, sync_dist=True)
self.validation_step_outputs.append(loss.detach())
return loss
Optimizer Configuration — Differential Learning Rates
Different parts of the model need different learning rates. The projector is learning from scratch, but the LoRA adapters are fine-tuning a pretrained model:
def configure_optimizers(self):
param_groups = []
# Vision projector — higher LR (learning from scratch)
vision_proj_params = list(self.model.vision_projector.parameters())
param_groups.append({
"params": vision_proj_params,
"lr": float(self.config["training"]["projector_lr"]), # 1e-3
"name": "vision_projector"
})
# LoRA adapters — lower LR (fine-tuning)
lora_params = [p for n, p in self.model.language_model.named_parameters() if p.requires_grad]
param_groups.append({
"params": lora_params,
"lr": float(self.config["training"]["lora_lr"]), # 2e-4
"name": "lora_adapters"
})
optimizer = torch.optim.AdamW(param_groups, weight_decay=0.01, eps=1e-8)
# Linear warmup then decay
total_steps = (len(self.trainer.datamodule.train_dataloader()) // accumulate_grad_batches) * max_epochs
warmup_steps = int(total_steps * 0.03)
scheduler = get_linear_schedule_with_warmup(optimizer, warmup_steps, total_steps)
return {"optimizer": optimizer, "lr_scheduler": {"scheduler": scheduler, "interval": "step"}}
04 Data — The LLaVA Dataset
Dataset Structure
We train on the LLaVA-Instruct-150K dataset: 157,712 image-text conversation pairs built from COCO images. Each sample has:
- An image (from MS-COCO)
- A multi-turn conversation (Human asks, Assistant responds)
- Visual grounding (the conversation references specific image content)
The Data Pipeline
class LLaVADataModule(L.LightningDataModule):
def __init__(self, tokenizer, vision_processor, config):
super().__init__()
self.tokenizer = tokenizer
self.vision_processor = vision_processor
self.batch_size = config["training"]["batch_size"]
self.num_workers = config["data"].get("num_workers", 4)
self.val_size = config["data"].get("val_size", 0.02)
self.collator = MultimodalCollator(
tokenizer=self.tokenizer,
vision_processor=self.vision_processor,
config=config
)
def setup(self, stage=None):
full_dataset = LLaVADataset(config=self.config, split="train")
total_size = len(full_dataset)
val_size = int(total_size * self.val_size)
train_size = total_size - val_size
self.train_dataset, self.val_dataset = random_split(
full_dataset, [train_size, val_size],
generator=torch.Generator().manual_seed(42)
)
def train_dataloader(self):
return DataLoader(
self.train_dataset,
batch_size=self.batch_size,
shuffle=True,
num_workers=self.num_workers,
pin_memory=True,
persistent_workers=True,
prefetch_factor=2,
collate_fn=self.collator,
drop_last=True
)
05 Training Configuration with Hydra
Why Hydra?
Hardcoded hyperparameters are a recipe for chaos. Hydra lets you define all configuration in YAML files and override anything from the command line:
# Default training
uv run python train.py
# Quick test with subset
uv run python train.py data.use_subset=true data.subset_size=50000 training.max_epochs=1
# Full training with MLflow
uv run python train.py data.use_subset=false training.max_epochs=3 training.batch_size=20 logging.use_mlflow=true
The Training Script
torch.set_float32_matmul_precision('high') # Enable Tensor Cores
@hydra.main(version_base=None, config_path="configs", config_name="config")
def hydra_main(cfg):
config = OmegaConf.to_container(cfg, resolve=True)
L.seed_everything(42)
model = MultimodalGemmaLightning(config)
datamodule = LLaVADataModule(
tokenizer=model.model.tokenizer,
vision_processor=model.model.vision_processor,
config=config
)
callbacks = [
RichProgressBar(),
ModelCheckpoint(monitor="val/loss", save_top_k=2, save_last=True),
EarlyStopping(monitor="val/loss", patience=3),
LearningRateMonitor(logging_interval="step"),
]
trainer = Trainer(
accelerator="auto",
devices="auto",
max_epochs=config["training"]["max_epochs"],
accumulate_grad_batches=2,
gradient_clip_val=1.0,
precision="bf16-mixed",
callbacks=callbacks,
logger=[MLFlowLogger(...), TensorBoardLogger(...)],
)
trainer.fit(model, datamodule)
trainer.save_checkpoint("final_model.ckpt")
Training Configuration
| Parameter | Value | Why |
|---|---|---|
| Dataset | LLaVA-Instruct-150K | 157,712 image-text pairs from COCO |
| Epochs | 3 | More would overfit on this dataset size |
| Batch Size | 20 (effective 40) | Largest that fits in A100 40GB with this model |
| Precision | bf16-mixed | Halves memory, ~50% faster, negligible quality loss |
| Gradient Clipping | 1.0 | Prevents exploding gradients during multimodal fusion |
| Warmup | 3% of total steps | Stabilizes early training when projector is random |
| Projector LR | 1e-3 | Learning from scratch needs higher LR |
| LoRA LR | 2e-4 | Fine-tuning needs gentler LR |
| Weight Decay | 0.01 | Standard regularization |
| LoRA Rank | 64 | Good capacity without excessive parameters |
| LoRA Alpha | 128 | Effective scaling = 2.0 |
06 Training Results — What 9 Hours on an A100 Gets You
Training Metrics
| Epoch | Train Loss | Val Loss | Notes |
|---|---|---|---|
| 1 | 1.892 | 1.598 | Projector learning fast, model starts grounding |
| 2 | 1.456 | 1.462 | Loss converging, responses getting coherent |
| 3 | 1.333 | 1.430 | Best val loss, slight overfitting beginning |
What the Model Learned
After 9 hours of training, the model can:
- Identify animals: “Two cats lying on couch with one on left side, other on right side...”
- Understand rooms: “Modern spacious kitchen with yellow walls, wood floors, dining table, refrigerator...”
- Recognize food: “Close-up donut in plastic bag on table between two bananas...”
- Describe activities: “Lively skate park scene with multiple skateboarders practicing tricks...”
Benchmark Results
| Benchmark | Score | Notes |
|---|---|---|
| Basic VQA | 53.8% (7/13) | Animal and room identification strong |
| POPE Hallucination | 80% accuracy | 20% hallucination rate (yes-bias typical for small models) |
GPU Cost Analysis
| GPU | VRAM | Batch Size | Training Time | Cost/hr | Total Cost |
|---|---|---|---|---|---|
| A10 | 24GB | 8-12 | ~15-18 hrs | $0.75 | ~$12.75 |
| A100 (40GB) | 40GB | 16-20 | ~9 hrs | $1.29 | ~$11.61 |
| A100 (80GB) | 80GB | 24-32 | ~6-7 hrs | $1.99 | ~$13.30 |
| H100 | 80GB | 32-48 | ~4-5 hrs | $2.49 | ~$11.20 |
07 Evaluation — Measuring What the Model Sees
Basic VQA Evaluation
def generate_answer(model, image, question, device):
image_inputs = model.vision_processor(images=image, return_tensors="pt")
pixel_values = image_inputs["pixel_values"].to(device)
prompt = f"<image>\nHuman: {question}\nAssistant:"
text_inputs = model.tokenizer(prompt, return_tensors="pt", padding=True,
truncation=True, max_length=512)
input_ids = text_inputs["input_ids"].to(device)
attention_mask = text_inputs["attention_mask"].to(device)
with torch.no_grad():
outputs = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
images=pixel_values,
max_new_tokens=30,
temperature=0.0,
do_sample=False,
)
response = model.tokenizer.decode(outputs[0], skip_special_tokens=True)
if "Assistant:" in response:
response = response.split("Assistant:", 1)[1]
return response.lower().strip()
POPE Hallucination Test
The POPE (Polling-based Object Probing Evaluation) test checks if the model hallucinates objects that aren’t in the image:
def run_pope_mini(checkpoint_path, num_samples=100):
POPE_TESTS = [
{
"path": "samples/test_images/sample_001.jpg",
"present": ["cat", "couch"], # should answer "yes"
"absent": ["dog", "elephant", "car"], # should answer "no"
},
]
for test in POPE_TESTS:
image = Image.open(test["path"]).convert("RGB")
for obj in test["present"]:
question = f"Is there a {obj} in this image? Answer yes or no."
answer = generate_answer(model, image, question, device)
if "yes" in answer: tp += 1
else: fn += 1
for obj in test["absent"]:
question = f"Is there a {obj} in this image? Answer yes or no."
answer = generate_answer(model, image, question, device)
if "no" in answer: tn += 1
else: fp += 1 # hallucination!
accuracy = (tp + tn) / (tp + fp + tn + fn)
hallucination_rate = fp / (fp + tn)
08 MLOps — The Full CI/CD Pipeline
GitHub Actions Workflow
Every push to main triggers an automated pipeline:
Push to main
│
▾
Tests (pytest + ruff linting)
│
▾
Download checkpoint from HuggingFace Hub
│
▾
Export model for deployment
│
▾
Deploy to HuggingFace Spaces (Gradio app)
name: MLOps Pipeline - Test, Trace & Deploy
on:
push:
branches: [main]
paths: ['src/**', 'configs/**', 'hf_space/**']
workflow_dispatch:
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install UV
run: curl -LsSf https://astral.sh/uv/install.sh | sh
- name: Lint with ruff
run: ruff check src/ --ignore E501,F401
- name: Run tests
run: pytest tests/ -v --tb=short
trace-and-deploy:
needs: test
if: github.ref == 'refs/heads/main'
steps:
- name: Login to HuggingFace
run: python -c "from huggingface_hub import login; login(token=os.environ['HF_TOKEN'])"
- name: Download checkpoint
run: python scripts/download_checkpoint.py
- name: Export model
run: python src/trace_model.py --ckpt_path final_model.ckpt --output_path hf_space/model.pt
- name: Deploy to HuggingFace Spaces
run: python scripts/deploy_to_hf.py
Tools Used
| Tool | Purpose | Why |
|---|---|---|
| PyTorch Lightning | Training framework | Clean code, automatic optimization, multi-GPU |
| Hydra | Configuration | YAML configs, command-line overrides |
| MLflow | Experiment tracking | Loss curves, hyperparameters, model versioning |
| DVC | Data versioning | Track 157K training samples across experiments |
| GitHub Actions | CI/CD | Automated testing and deployment on every push |
| HuggingFace Spaces | Deployment | Free GPU inference, Gradio UI, public demo |
| Docker | Containerization | Reproducible environments |
| Weights & Biases | (Optional) logging | Real-time training visualization |
09 Deployment — From Checkpoint to Live Demo
The Gradio App
def predict_with_image(image, question, max_tokens=100, temperature=0.7):
if not isinstance(image, Image.Image):
image = Image.fromarray(image).convert('RGB')
vision_inputs = vision_processor(images=[image], return_tensors="pt")
pixel_values = vision_inputs["pixel_values"].to(device)
prompt = f"<image>\nHuman: {question}\nAssistant:"
text_inputs = tokenizer(prompt, return_tensors="pt", padding=True,
truncation=True, max_length=256)
input_ids = text_inputs["input_ids"].to(device)
attention_mask = text_inputs["attention_mask"].to(device)
with torch.no_grad():
outputs = model.model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
images=pixel_values,
max_new_tokens=min(max_tokens, 150),
temperature=min(max(temperature, 0.1), 2.0),
do_sample=temperature > 0.1,
repetition_penalty=1.1
)
generated_tokens = outputs[0][input_ids.shape[1]:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
return response.strip()
Live Demo
The model is deployed at huggingface.co/spaces/sagar007/Multimodal-Gemma. Upload any image, ask a question, and see the model respond in real-time.
10 Project Structure — How It All Fits Together
multimodal-gemma-270m/
├── .github/workflows/train_deploy.yml # CI/CD pipeline
├── configs/
│ ├── config.yaml # Main Hydra config
│ ├── model_config.yaml # Architecture settings
│ ├── training_config.yaml # Hyperparameters
│ └── data_config.yaml # Dataset config
├── src/
│ ├── models/
│ │ ├── lightning_module.py # PyTorch Lightning trainer
│ │ ├── multimodal_gemma.py # Core model architecture
│ │ └── projectors.py # Vision projector MLP
│ ├── data/datamodule.py # Lightning DataModule
│ └── utils/config.py # Configuration utilities
├── hf_space/
│ ├── app.py # Gradio application
│ └── requirements.txt # Space dependencies
├── samples/test_images/ # Evaluation images
├── train.py # Training entry point
├── inference.py # Inference script
├── evaluate.py # Benchmark evaluation
├── gradio_app.py # Local web interface
├── dvc.yaml # DVC pipeline
├── Dockerfile # Container image
├── Makefile # Convenience commands
└── pyproject.toml # Project metadata
src/models/, src/data/, and configs/ is intentional. You should be able to swap out the language model (e.g., replace Gemma with Phi or Qwen), change the dataset (e.g., switch to ShareGPT), or modify training hyperparameters — all without touching the core model code.
Summary: What I Learned Building This
- 1. Multimodal fusion is simpler than you think — A 2-layer MLP projector is all you need to bridge vision and language. The pretrained models do the heavy lifting.
- 2. LoRA makes everything practical — Training only 3.4% of parameters (18.6M out of 539M) means you can fine-tune on a single GPU in hours, not days.
- 3. The merge function is everything — How you interleave image and text tokens determines what the model can learn. Getting the attention masks and label masking right is critical.
- 4. Small models teach big lessons — A 270M model won’t win benchmarks, but it will teach you every architectural decision that matters. Scale is a knob you turn later.
- 5. MLOps completes the loop — A model that only runs in your notebook isn’t useful. CI/CD, experiment tracking, and one-click deployment turn a research project into a product.
- 6. Open source everything — The full code, model weights, and training logs are available. If one person learns from this, it was worth the 9 hours of A100 time.
Building a multimodal model from scratch isn't about competing with GPT-4V — it's about
understanding every decision from pixels to predictions. The code is at
github.com/sagar431/multimodal-gemma-270m. Clone it, train it, break it, fix it.