Building a Multimodal Vision-Language Model from Scratch

Why This Blog Exists

Most tutorials on vision-language models start with “pip install transformers” and end with calling a pretrained model. That’s fine for applications — but it teaches you nothing about how these systems actually work.

This blog documents how I built a complete multimodal vision-language model from scratch — combining Google’s Gemma-270M language model with OpenAI’s CLIP vision encoder, training it on 157K image-text pairs from the LLaVA dataset, and deploying it to HuggingFace Spaces with a full CI/CD pipeline. The entire project is open-source at github.com/sagar431/multimodal-gemma-270m.

The key insight: you don’t need a 70B parameter model to understand vision-language architectures. A 270M parameter model trained on a single A100 for 9 hours can teach you everything about how images and text are fused, how projector layers bridge modalities, and why LoRA makes fine-tuning practical.

01 The Architecture — How Images and Text Become One

The Core Idea

A multimodal model needs to solve one fundamental problem: images live in pixel space (continuous, spatial, high-dimensional), while text lives in token space (discrete, sequential, vocabulary-indexed). How do you get them to talk to each other?

The answer, pioneered by LLaVA (Liu et al., 2023), is elegantly simple:

Use a pretrained vision encoder (CLIP) to convert images into a sequence of feature vectors
Use a projection layer (MLP) to map those vision features into the language model’s embedding space
Interleave the projected image tokens with text tokens
Let the language model process everything as if it were all text

Image (224×224×3)
    │
    ▾
CLIP ViT-Large/14 (frozen, 428M params)
    │
    ▾
Patch Tokens: [batch, 256, 1024]
    │
    ▾
Vision Projector MLP (trainable)
    │
    ▾
Projected Tokens: [batch, 256, 1536]
    │
    ├—— merged with ——▸ Text Tokens: [batch, seq_len, 1536]
    │
    ▾
Gemma-270M Language Model (LoRA adapters, trainable)
    │
    ▾
Generated Text Response

Why These Specific Models?

Component	Model	Parameters	Why This One
Vision Encoder	CLIP ViT-Large/14	428M (frozen)	Best open vision encoder, trained on 400M image-text pairs
Language Model	Gemma-270M	270M (LoRA)	Small enough for single-GPU, powerful enough to generate coherent text
Projector	2-layer MLP	~6M (trainable)	Simple, proven in LLaVA, fast to train
Total	—	539M	Only 18.6M trainable (3.4%)

The key architectural decision is keeping the vision encoder frozen. CLIP already knows how to extract meaningful visual features from 400M image-text pairs. We don’t want to mess that up. We only train the projector (to bridge the modality gap) and LoRA adapters (to teach Gemma how to talk about what it sees).

The Vision Projector — Bridging Two Worlds

The projector is deceptively important. It’s “just” a 2-layer MLP, but it’s doing the critical job of translating CLIP’s visual language into Gemma’s text language.

Python projectors.py

class VisionProjector(nn.Module):
    """Projects vision features to language model embedding space.
    Uses a 2-layer MLP with GELU activation following the LLaVA architecture."""

    def __init__(self, vision_dim, language_dim, hidden_dim=None, dropout=0.0):
        super().__init__()
        hidden_dim = hidden_dim or (language_dim * 2)

        self.fc1 = nn.Linear(vision_dim, hidden_dim)
        self.act = nn.GELU()
        self.fc2 = nn.Linear(hidden_dim, language_dim)
        self.dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()

        self._init_weights()

    def _init_weights(self):
        for module in [self.fc1, self.fc2]:
            if isinstance(module, nn.Linear):
                nn.init.xavier_uniform_(module.weight)
                if module.bias is not None:
                    nn.init.zeros_(module.bias)

    def forward(self, vision_features):
        x = self.fc1(vision_features)
        x = self.act(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

Why Xavier initialization? Because the projector sits between two pretrained models. If its initial outputs are too large or too small, the language model will ignore the image tokens or produce garbage. Xavier keeps the variance stable across the layers.

02 Building the Multimodal Model

The MultimodalGemma Class

This is the core of the entire system. It wires together the vision encoder, projector, and language model into a single differentiable pipeline.

Python multimodal_gemma.py

class MultimodalGemma(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config

        self._setup_tokenizer()
        self._setup_language_model()
        self._setup_vision_components()
        self._setup_projectors()
        self._freeze_encoders()
        self._setup_lora()

Each setup method handles one component. Let me walk through the critical ones:

Setting Up the Vision Encoder

Python multimodal_gemma.py

def _setup_vision_components(self):
    vision_model_name = self.config["model"]["vision_model_name"]

    self.vision_encoder = CLIPVisionModel.from_pretrained(
        vision_model_name,
        torch_dtype=torch.bfloat16
    )
    self.vision_processor = CLIPProcessor.from_pretrained(vision_model_name)

CLIP ViT-Large/14 splits a 224×224 image into 16×16 patches (14×14 pixel each), producing 196 patch tokens + 1 CLS token. We use bfloat16 to cut memory in half without meaningful precision loss.

LoRA — Making Fine-Tuning Practical

Full fine-tuning of even a 270M model would update all 270M parameters. With LoRA (Low-Rank Adaptation), we inject small trainable matrices into the attention layers:

Python multimodal_gemma.py

def _setup_lora(self):
    lora_config = LoraConfig(
        r=64,                    # rank of the low-rank matrices
        lora_alpha=128,          # scaling factor
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        lora_dropout=0.1,
        bias="none",
        task_type=TaskType.CAUSAL_LM,
    )

    self.language_model = get_peft_model(self.language_model, lora_config)
    self.language_model.print_trainable_parameters()
    # Output: trainable params: 12,582,912 || all params: 270,000,000 || trainable%: 4.66%

LoRA rank r=64 with alpha=128 means we’re using an effective scaling of alpha/r = 2.0. Higher rank = more capacity but more parameters. For a 270M model, r=64 provides enough capacity to learn visual grounding without overfitting. We target all four attention projection matrices (Q, K, V, O) for maximum expressiveness.

The Image-Text Merge — The Heart of Multimodality

This is the most subtle and important function in the entire codebase. When the model receives <image> What is this?, it needs to:

Convert <image> into 256 actual vision tokens
Expand the sequence length accordingly
Update attention masks and labels to match

Python multimodal_gemma.py

def encode_images(self, images):
    with torch.no_grad():
        vision_outputs = self.vision_encoder(pixel_values=images)
        # Use ALL patch tokens (skip CLS at index 0)
        image_features = vision_outputs.last_hidden_state[:, 1:, :]

    projected_features = self.vision_projector(image_features)
    return projected_features


def _merge_image_features(self, input_ids, image_features, attention_mask, labels):
    batch_size, seq_len = input_ids.shape
    num_image_tokens = image_features.shape[1]  # 256 patches
    device = input_ids.device

    text_embeds = self.language_model.get_input_embeddings()(input_ids)
    image_token_mask = (input_ids == self.image_token_id)

    new_seq_len = seq_len - 1 + num_image_tokens  # replace 1 token with 256

    new_embeds = torch.zeros(batch_size, new_seq_len, text_embeds.shape[-1],
                              dtype=text_embeds.dtype, device=device)
    new_attention_mask = torch.zeros(batch_size, new_seq_len,
                                      dtype=attention_mask.dtype, device=device)
    new_labels = torch.full((batch_size, new_seq_len), -100,
                             dtype=labels.dtype, device=device) if labels is not None else None

    for batch_idx in range(batch_size):
        image_positions = torch.where(image_token_mask[batch_idx])[0]

        if len(image_positions) > 0:
            img_pos = image_positions[0].item()

            # Copy text before <image>
            new_embeds[batch_idx, :img_pos] = text_embeds[batch_idx, :img_pos]
            # Insert all 256 image tokens
            new_embeds[batch_idx, img_pos:img_pos + num_image_tokens] = image_features[batch_idx]
            # Copy text after <image>
            remaining_len = seq_len - img_pos - 1
            new_embeds[batch_idx, img_pos + num_image_tokens:img_pos + num_image_tokens + remaining_len] = \
                text_embeds[batch_idx, img_pos + 1:img_pos + 1 + remaining_len]

            # Update attention mask
            new_attention_mask[batch_idx, :img_pos] = attention_mask[batch_idx, :img_pos]
            new_attention_mask[batch_idx, img_pos:img_pos + num_image_tokens] = 1
            new_attention_mask[batch_idx, img_pos + num_image_tokens:img_pos + num_image_tokens + remaining_len] = \
                attention_mask[batch_idx, img_pos + 1:img_pos + 1 + remaining_len]

            # Labels: mask image positions with -100 (ignore in loss)
            if labels is not None:
                new_labels[batch_idx, :img_pos] = labels[batch_idx, :img_pos]
                new_labels[batch_idx, img_pos + num_image_tokens:img_pos + num_image_tokens + remaining_len] = \
                    labels[batch_idx, img_pos + 1:img_pos + 1 + remaining_len]

    return new_embeds, new_attention_mask, new_labels

Why label image tokens as -100? Because we don’t want the model to “predict” image content — that’s the vision encoder’s job. The language model should only be penalized for generating wrong text tokens. Setting labels to -100 tells PyTorch’s CrossEntropyLoss to ignore those positions entirely.

03 The Training Pipeline with PyTorch Lightning

Why Lightning?

PyTorch Lightning removes the boilerplate from training: gradient accumulation, mixed precision, multi-GPU, logging, checkpointing — all handled by the framework. You focus on the model, not the training loop.

The Lightning Module

Python lightning_module.py

class MultimodalGemmaLightning(L.LightningModule):
    def __init__(self, config):
        super().__init__()
        self.save_hyperparameters()
        self.config = config
        self.model = MultimodalGemma(config)
        self.training_step_outputs = []
        self.validation_step_outputs = []

    def forward(self, batch):
        return self.model(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
            images=batch.get("images"),
            labels=batch["labels"]
        )

    def training_step(self, batch, batch_idx):
        outputs = self(batch)
        loss = outputs["loss"]
        self.log("train/loss", loss, on_step=True, on_epoch=True, prog_bar=True, sync_dist=True)
        self.log("train/learning_rate", self.optimizers().param_groups[0]["lr"], on_step=True)
        self.training_step_outputs.append(loss.detach())
        return loss

    def validation_step(self, batch, batch_idx):
        outputs = self(batch)
        loss = outputs["loss"]
        self.log("val/loss", loss, on_step=False, on_epoch=True, prog_bar=True, sync_dist=True)
        self.validation_step_outputs.append(loss.detach())
        return loss

Optimizer Configuration — Differential Learning Rates

Different parts of the model need different learning rates. The projector is learning from scratch, but the LoRA adapters are fine-tuning a pretrained model:

Python lightning_module.py

def configure_optimizers(self):
    param_groups = []

    # Vision projector — higher LR (learning from scratch)
    vision_proj_params = list(self.model.vision_projector.parameters())
    param_groups.append({
        "params": vision_proj_params,
        "lr": float(self.config["training"]["projector_lr"]),  # 1e-3
        "name": "vision_projector"
    })

    # LoRA adapters — lower LR (fine-tuning)
    lora_params = [p for n, p in self.model.language_model.named_parameters() if p.requires_grad]
    param_groups.append({
        "params": lora_params,
        "lr": float(self.config["training"]["lora_lr"]),  # 2e-4
        "name": "lora_adapters"
    })

    optimizer = torch.optim.AdamW(param_groups, weight_decay=0.01, eps=1e-8)

    # Linear warmup then decay
    total_steps = (len(self.trainer.datamodule.train_dataloader()) // accumulate_grad_batches) * max_epochs
    warmup_steps = int(total_steps * 0.03)

    scheduler = get_linear_schedule_with_warmup(optimizer, warmup_steps, total_steps)

    return {"optimizer": optimizer, "lr_scheduler": {"scheduler": scheduler, "interval": "step"}}

Differential learning rates are critical. The projector starts with random weights so it needs a higher LR (1e-3) to learn quickly. The LoRA adapters are modifying a pretrained model, so they need a gentle LR (2e-4) to avoid catastrophic forgetting. If you use the same LR for both, either the projector learns too slowly or the LoRA adapters forget too much.

04 Data — The LLaVA Dataset

Dataset Structure

We train on the LLaVA-Instruct-150K dataset: 157,712 image-text conversation pairs built from COCO images. Each sample has:

An image (from MS-COCO)
A multi-turn conversation (Human asks, Assistant responds)
Visual grounding (the conversation references specific image content)

The Data Pipeline

Python datamodule.py

class LLaVADataModule(L.LightningDataModule):
    def __init__(self, tokenizer, vision_processor, config):
        super().__init__()
        self.tokenizer = tokenizer
        self.vision_processor = vision_processor
        self.batch_size = config["training"]["batch_size"]
        self.num_workers = config["data"].get("num_workers", 4)
        self.val_size = config["data"].get("val_size", 0.02)

        self.collator = MultimodalCollator(
            tokenizer=self.tokenizer,
            vision_processor=self.vision_processor,
            config=config
        )

    def setup(self, stage=None):
        full_dataset = LLaVADataset(config=self.config, split="train")
        total_size = len(full_dataset)
        val_size = int(total_size * self.val_size)
        train_size = total_size - val_size

        self.train_dataset, self.val_dataset = random_split(
            full_dataset, [train_size, val_size],
            generator=torch.Generator().manual_seed(42)
        )

    def train_dataloader(self):
        return DataLoader(
            self.train_dataset,
            batch_size=self.batch_size,
            shuffle=True,
            num_workers=self.num_workers,
            pin_memory=True,
            persistent_workers=True,
            prefetch_factor=2,
            collate_fn=self.collator,
            drop_last=True
        )

05 Training Configuration with Hydra

Why Hydra?

Hardcoded hyperparameters are a recipe for chaos. Hydra lets you define all configuration in YAML files and override anything from the command line:

Bash terminal

# Default training
uv run python train.py

# Quick test with subset
uv run python train.py data.use_subset=true data.subset_size=50000 training.max_epochs=1

# Full training with MLflow
uv run python train.py data.use_subset=false training.max_epochs=3 training.batch_size=20 logging.use_mlflow=true

The Training Script

Python train.py

torch.set_float32_matmul_precision('high')  # Enable Tensor Cores

@hydra.main(version_base=None, config_path="configs", config_name="config")
def hydra_main(cfg):
    config = OmegaConf.to_container(cfg, resolve=True)

    L.seed_everything(42)

    model = MultimodalGemmaLightning(config)
    datamodule = LLaVADataModule(
        tokenizer=model.model.tokenizer,
        vision_processor=model.model.vision_processor,
        config=config
    )

    callbacks = [
        RichProgressBar(),
        ModelCheckpoint(monitor="val/loss", save_top_k=2, save_last=True),
        EarlyStopping(monitor="val/loss", patience=3),
        LearningRateMonitor(logging_interval="step"),
    ]

    trainer = Trainer(
        accelerator="auto",
        devices="auto",
        max_epochs=config["training"]["max_epochs"],
        accumulate_grad_batches=2,
        gradient_clip_val=1.0,
        precision="bf16-mixed",
        callbacks=callbacks,
        logger=[MLFlowLogger(...), TensorBoardLogger(...)],
    )

    trainer.fit(model, datamodule)
    trainer.save_checkpoint("final_model.ckpt")

Training Configuration

Parameter	Value	Why
Dataset	LLaVA-Instruct-150K	157,712 image-text pairs from COCO
Epochs	3	More would overfit on this dataset size
Batch Size	20 (effective 40)	Largest that fits in A100 40GB with this model
Precision	bf16-mixed	Halves memory, ~50% faster, negligible quality loss
Gradient Clipping	1.0	Prevents exploding gradients during multimodal fusion
Warmup	3% of total steps	Stabilizes early training when projector is random
Projector LR	1e-3	Learning from scratch needs higher LR
LoRA LR	2e-4	Fine-tuning needs gentler LR
Weight Decay	0.01	Standard regularization
LoRA Rank	64	Good capacity without excessive parameters
LoRA Alpha	128	Effective scaling = 2.0

06 Training Results — What 9 Hours on an A100 Gets You

Training Metrics

Epoch	Train Loss	Val Loss	Notes
1	1.892	1.598	Projector learning fast, model starts grounding
2	1.456	1.462	Loss converging, responses getting coherent
3	1.333	1.430	Best val loss, slight overfitting beginning

What the Model Learned

After 9 hours of training, the model can:

Identify animals: “Two cats lying on couch with one on left side, other on right side...”
Understand rooms: “Modern spacious kitchen with yellow walls, wood floors, dining table, refrigerator...”
Recognize food: “Close-up donut in plastic bag on table between two bananas...”
Describe activities: “Lively skate park scene with multiple skateboarders practicing tricks...”

Benchmark Results

Benchmark	Score	Notes
Basic VQA	53.8% (7/13)	Animal and room identification strong
POPE Hallucination	80% accuracy	20% hallucination rate (yes-bias typical for small models)

A 53.8% VQA accuracy from a 270M model trained for 9 hours is actually impressive. For context, the original LLaVA used a 7B model (26× larger) trained for much longer. Our tiny model demonstrates that the architecture works — scaling up the base model and training data would dramatically improve results.

GPU Cost Analysis

GPU	VRAM	Batch Size	Training Time	Cost/hr	Total Cost
A10	24GB	8-12	~15-18 hrs	$0.75	~$12.75
A100 (40GB)	40GB	16-20	~9 hrs	$1.29	~$11.61
A100 (80GB)	80GB	24-32	~6-7 hrs	$1.99	~$13.30
H100	80GB	32-48	~4-5 hrs	$2.49	~$11.20

07 Evaluation — Measuring What the Model Sees

Basic VQA Evaluation

Python evaluate.py

def generate_answer(model, image, question, device):
    image_inputs = model.vision_processor(images=image, return_tensors="pt")
    pixel_values = image_inputs["pixel_values"].to(device)

    prompt = f"<image>\nHuman: {question}\nAssistant:"
    text_inputs = model.tokenizer(prompt, return_tensors="pt", padding=True,
                                   truncation=True, max_length=512)
    input_ids = text_inputs["input_ids"].to(device)
    attention_mask = text_inputs["attention_mask"].to(device)

    with torch.no_grad():
        outputs = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            images=pixel_values,
            max_new_tokens=30,
            temperature=0.0,
            do_sample=False,
        )

    response = model.tokenizer.decode(outputs[0], skip_special_tokens=True)
    if "Assistant:" in response:
        response = response.split("Assistant:", 1)[1]
    return response.lower().strip()

POPE Hallucination Test

The POPE (Polling-based Object Probing Evaluation) test checks if the model hallucinates objects that aren’t in the image:

Python evaluate.py

def run_pope_mini(checkpoint_path, num_samples=100):
    POPE_TESTS = [
        {
            "path": "samples/test_images/sample_001.jpg",
            "present": ["cat", "couch"],           # should answer "yes"
            "absent": ["dog", "elephant", "car"],   # should answer "no"
        },
    ]

    for test in POPE_TESTS:
        image = Image.open(test["path"]).convert("RGB")

        for obj in test["present"]:
            question = f"Is there a {obj} in this image? Answer yes or no."
            answer = generate_answer(model, image, question, device)
            if "yes" in answer: tp += 1
            else: fn += 1

        for obj in test["absent"]:
            question = f"Is there a {obj} in this image? Answer yes or no."
            answer = generate_answer(model, image, question, device)
            if "no" in answer: tn += 1
            else: fp += 1  # hallucination!

    accuracy = (tp + tn) / (tp + fp + tn + fn)
    hallucination_rate = fp / (fp + tn)

08 MLOps — The Full CI/CD Pipeline

GitHub Actions Workflow

Every push to main triggers an automated pipeline:

Push to main
    │
    ▾
Tests (pytest + ruff linting)
    │
    ▾
Download checkpoint from HuggingFace Hub
    │
    ▾
Export model for deployment
    │
    ▾
Deploy to HuggingFace Spaces (Gradio app)

YAML train_deploy.yml

name: MLOps Pipeline - Test, Trace & Deploy

on:
  push:
    branches: [main]
    paths: ['src/**', 'configs/**', 'hf_space/**']
  workflow_dispatch:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install UV
        run: curl -LsSf https://astral.sh/uv/install.sh | sh
      - name: Lint with ruff
        run: ruff check src/ --ignore E501,F401
      - name: Run tests
        run: pytest tests/ -v --tb=short

  trace-and-deploy:
    needs: test
    if: github.ref == 'refs/heads/main'
    steps:
      - name: Login to HuggingFace
        run: python -c "from huggingface_hub import login; login(token=os.environ['HF_TOKEN'])"
      - name: Download checkpoint
        run: python scripts/download_checkpoint.py
      - name: Export model
        run: python src/trace_model.py --ckpt_path final_model.ckpt --output_path hf_space/model.pt
      - name: Deploy to HuggingFace Spaces
        run: python scripts/deploy_to_hf.py

Tools Used

Tool	Purpose	Why
PyTorch Lightning	Training framework	Clean code, automatic optimization, multi-GPU
Hydra	Configuration	YAML configs, command-line overrides
MLflow	Experiment tracking	Loss curves, hyperparameters, model versioning
DVC	Data versioning	Track 157K training samples across experiments
GitHub Actions	CI/CD	Automated testing and deployment on every push
HuggingFace Spaces	Deployment	Free GPU inference, Gradio UI, public demo
Docker	Containerization	Reproducible environments
Weights & Biases	(Optional) logging	Real-time training visualization

09 Deployment — From Checkpoint to Live Demo

The Gradio App

Python app.py

def predict_with_image(image, question, max_tokens=100, temperature=0.7):
    if not isinstance(image, Image.Image):
        image = Image.fromarray(image).convert('RGB')

    vision_inputs = vision_processor(images=[image], return_tensors="pt")
    pixel_values = vision_inputs["pixel_values"].to(device)

    prompt = f"<image>\nHuman: {question}\nAssistant:"
    text_inputs = tokenizer(prompt, return_tensors="pt", padding=True,
                            truncation=True, max_length=256)
    input_ids = text_inputs["input_ids"].to(device)
    attention_mask = text_inputs["attention_mask"].to(device)

    with torch.no_grad():
        outputs = model.model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            images=pixel_values,
            max_new_tokens=min(max_tokens, 150),
            temperature=min(max(temperature, 0.1), 2.0),
            do_sample=temperature > 0.1,
            repetition_penalty=1.1
        )

    generated_tokens = outputs[0][input_ids.shape[1]:]
    response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
    return response.strip()

Live Demo

The model is deployed at huggingface.co/spaces/sagar007/Multimodal-Gemma. Upload any image, ask a question, and see the model respond in real-time.

10 Project Structure — How It All Fits Together

Bash project layout

multimodal-gemma-270m/
├── .github/workflows/train_deploy.yml    # CI/CD pipeline
├── configs/
│   ├── config.yaml                       # Main Hydra config
│   ├── model_config.yaml                 # Architecture settings
│   ├── training_config.yaml              # Hyperparameters
│   └── data_config.yaml                  # Dataset config
├── src/
│   ├── models/
│   │   ├── lightning_module.py           # PyTorch Lightning trainer
│   │   ├── multimodal_gemma.py           # Core model architecture
│   │   └── projectors.py                 # Vision projector MLP
│   ├── data/datamodule.py                # Lightning DataModule
│   └── utils/config.py                   # Configuration utilities
├── hf_space/
│   ├── app.py                            # Gradio application
│   └── requirements.txt                  # Space dependencies
├── samples/test_images/                  # Evaluation images
├── train.py                              # Training entry point
├── inference.py                          # Inference script
├── evaluate.py                           # Benchmark evaluation
├── gradio_app.py                         # Local web interface
├── dvc.yaml                              # DVC pipeline
├── Dockerfile                            # Container image
├── Makefile                              # Convenience commands
└── pyproject.toml                        # Project metadata

The clean separation between src/models/, src/data/, and configs/ is intentional. You should be able to swap out the language model (e.g., replace Gemma with Phi or Qwen), change the dataset (e.g., switch to ShareGPT), or modify training hyperparameters — all without touching the core model code.

Summary: What I Learned Building This

1. Multimodal fusion is simpler than you think — A 2-layer MLP projector is all you need to bridge vision and language. The pretrained models do the heavy lifting.
2. LoRA makes everything practical — Training only 3.4% of parameters (18.6M out of 539M) means you can fine-tune on a single GPU in hours, not days.
3. The merge function is everything — How you interleave image and text tokens determines what the model can learn. Getting the attention masks and label masking right is critical.
4. Small models teach big lessons — A 270M model won’t win benchmarks, but it will teach you every architectural decision that matters. Scale is a knob you turn later.
5. MLOps completes the loop — A model that only runs in your notebook isn’t useful. CI/CD, experiment tracking, and one-click deployment turn a research project into a product.
6. Open source everything — The full code, model weights, and training logs are available. If one person learns from this, it was worth the 9 hours of A100 time.

Building a multimodal model from scratch isn't about competing with GPT-4V — it's about
understanding every decision from pixels to predictions. The code is at
github.com/sagar431/multimodal-gemma-270m. Clone it, train it, break it, fix it.

Building a Multimodal Vision-Language Model from Scratch

Why This Blog Exists

01 The Architecture — How Images and Text Become One

The Core Idea

Why These Specific Models?

The Vision Projector — Bridging Two Worlds

02 Building the Multimodal Model

The MultimodalGemma Class

Setting Up the Vision Encoder

LoRA — Making Fine-Tuning Practical

The Image-Text Merge — The Heart of Multimodality

03 The Training Pipeline with PyTorch Lightning

Why Lightning?

The Lightning Module

Optimizer Configuration — Differential Learning Rates

04 Data — The LLaVA Dataset

Dataset Structure

The Data Pipeline

05 Training Configuration with Hydra

Why Hydra?

The Training Script

Training Configuration

06 Training Results — What 9 Hours on an A100 Gets You

Training Metrics

What the Model Learned

Benchmark Results

GPU Cost Analysis

07 Evaluation — Measuring What the Model Sees

Basic VQA Evaluation

POPE Hallucination Test

08 MLOps — The Full CI/CD Pipeline

GitHub Actions Workflow

Tools Used

09 Deployment — From Checkpoint to Live Demo

The Gradio App

Live Demo

10 Project Structure — How It All Fits Together

Summary: What I Learned Building This

Read Next

The Practitioner's Guide to LLMOps