Multimodal AI Feb 2026 35 min read

Building a Multimodal Vision-Language Model from Scratch

Gemma-270M + CLIP Vision = A tiny model that sees and speaks

Why This Blog Exists

Most tutorials on vision-language models start with “pip install transformers” and end with calling a pretrained model. That’s fine for applications — but it teaches you nothing about how these systems actually work.

This blog documents how I built a complete multimodal vision-language model from scratch — combining Google’s Gemma-270M language model with OpenAI’s CLIP vision encoder, training it on 157K image-text pairs from the LLaVA dataset, and deploying it to HuggingFace Spaces with a full CI/CD pipeline. The entire project is open-source at github.com/sagar431/multimodal-gemma-270m.

The key insight: you don’t need a 70B parameter model to understand vision-language architectures. A 270M parameter model trained on a single A100 for 9 hours can teach you everything about how images and text are fused, how projector layers bridge modalities, and why LoRA makes fine-tuning practical.

01 The Architecture — How Images and Text Become One

The Core Idea

A multimodal model needs to solve one fundamental problem: images live in pixel space (continuous, spatial, high-dimensional), while text lives in token space (discrete, sequential, vocabulary-indexed). How do you get them to talk to each other?

The answer, pioneered by LLaVA (Liu et al., 2023), is elegantly simple:

  • Use a pretrained vision encoder (CLIP) to convert images into a sequence of feature vectors
  • Use a projection layer (MLP) to map those vision features into the language model’s embedding space
  • Interleave the projected image tokens with text tokens
  • Let the language model process everything as if it were all text
Image (224×224×3)
    │
    ▾
CLIP ViT-Large/14 (frozen, 428M params)
    │
    ▾
Patch Tokens: [batch, 256, 1024]
    │
    ▾
Vision Projector MLP (trainable)
    │
    ▾
Projected Tokens: [batch, 256, 1536]
    │
    ├—— merged with ——▸ Text Tokens: [batch, seq_len, 1536]
    │
    ▾
Gemma-270M Language Model (LoRA adapters, trainable)
    │
    ▾
Generated Text Response

Why These Specific Models?

Component Model Parameters Why This One
Vision Encoder CLIP ViT-Large/14 428M (frozen) Best open vision encoder, trained on 400M image-text pairs
Language Model Gemma-270M 270M (LoRA) Small enough for single-GPU, powerful enough to generate coherent text
Projector 2-layer MLP ~6M (trainable) Simple, proven in LLaVA, fast to train
Total 539M Only 18.6M trainable (3.4%)
The key architectural decision is keeping the vision encoder frozen. CLIP already knows how to extract meaningful visual features from 400M image-text pairs. We don’t want to mess that up. We only train the projector (to bridge the modality gap) and LoRA adapters (to teach Gemma how to talk about what it sees).

The Vision Projector — Bridging Two Worlds

The projector is deceptively important. It’s “just” a 2-layer MLP, but it’s doing the critical job of translating CLIP’s visual language into Gemma’s text language.

Python projectors.py
class VisionProjector(nn.Module):
    """Projects vision features to language model embedding space.
    Uses a 2-layer MLP with GELU activation following the LLaVA architecture."""

    def __init__(self, vision_dim, language_dim, hidden_dim=None, dropout=0.0):
        super().__init__()
        hidden_dim = hidden_dim or (language_dim * 2)

        self.fc1 = nn.Linear(vision_dim, hidden_dim)
        self.act = nn.GELU()
        self.fc2 = nn.Linear(hidden_dim, language_dim)
        self.dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()

        self._init_weights()

    def _init_weights(self):
        for module in [self.fc1, self.fc2]:
            if isinstance(module, nn.Linear):
                nn.init.xavier_uniform_(module.weight)
                if module.bias is not None:
                    nn.init.zeros_(module.bias)

    def forward(self, vision_features):
        x = self.fc1(vision_features)
        x = self.act(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x
Why Xavier initialization? Because the projector sits between two pretrained models. If its initial outputs are too large or too small, the language model will ignore the image tokens or produce garbage. Xavier keeps the variance stable across the layers.

02 Building the Multimodal Model

The MultimodalGemma Class

This is the core of the entire system. It wires together the vision encoder, projector, and language model into a single differentiable pipeline.

Python multimodal_gemma.py
class MultimodalGemma(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config

        self._setup_tokenizer()
        self._setup_language_model()
        self._setup_vision_components()
        self._setup_projectors()
        self._freeze_encoders()
        self._setup_lora()

Each setup method handles one component. Let me walk through the critical ones:

Setting Up the Vision Encoder

Python multimodal_gemma.py
def _setup_vision_components(self):
    vision_model_name = self.config["model"]["vision_model_name"]

    self.vision_encoder = CLIPVisionModel.from_pretrained(
        vision_model_name,
        torch_dtype=torch.bfloat16
    )
    self.vision_processor = CLIPProcessor.from_pretrained(vision_model_name)

CLIP ViT-Large/14 splits a 224×224 image into 16×16 patches (14×14 pixel each), producing 196 patch tokens + 1 CLS token. We use bfloat16 to cut memory in half without meaningful precision loss.

LoRA — Making Fine-Tuning Practical

Full fine-tuning of even a 270M model would update all 270M parameters. With LoRA (Low-Rank Adaptation), we inject small trainable matrices into the attention layers:

Python multimodal_gemma.py
def _setup_lora(self):
    lora_config = LoraConfig(
        r=64,                    # rank of the low-rank matrices
        lora_alpha=128,          # scaling factor
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        lora_dropout=0.1,
        bias="none",
        task_type=TaskType.CAUSAL_LM,
    )

    self.language_model = get_peft_model(self.language_model, lora_config)
    self.language_model.print_trainable_parameters()
    # Output: trainable params: 12,582,912 || all params: 270,000,000 || trainable%: 4.66%
LoRA rank r=64 with alpha=128 means we’re using an effective scaling of alpha/r = 2.0. Higher rank = more capacity but more parameters. For a 270M model, r=64 provides enough capacity to learn visual grounding without overfitting. We target all four attention projection matrices (Q, K, V, O) for maximum expressiveness.

The Image-Text Merge — The Heart of Multimodality

This is the most subtle and important function in the entire codebase. When the model receives <image> What is this?, it needs to:

  • Convert <image> into 256 actual vision tokens
  • Expand the sequence length accordingly
  • Update attention masks and labels to match
Python multimodal_gemma.py
def encode_images(self, images):
    with torch.no_grad():
        vision_outputs = self.vision_encoder(pixel_values=images)
        # Use ALL patch tokens (skip CLS at index 0)
        image_features = vision_outputs.last_hidden_state[:, 1:, :]

    projected_features = self.vision_projector(image_features)
    return projected_features


def _merge_image_features(self, input_ids, image_features, attention_mask, labels):
    batch_size, seq_len = input_ids.shape
    num_image_tokens = image_features.shape[1]  # 256 patches
    device = input_ids.device

    text_embeds = self.language_model.get_input_embeddings()(input_ids)
    image_token_mask = (input_ids == self.image_token_id)

    new_seq_len = seq_len - 1 + num_image_tokens  # replace 1 token with 256

    new_embeds = torch.zeros(batch_size, new_seq_len, text_embeds.shape[-1],
                              dtype=text_embeds.dtype, device=device)
    new_attention_mask = torch.zeros(batch_size, new_seq_len,
                                      dtype=attention_mask.dtype, device=device)
    new_labels = torch.full((batch_size, new_seq_len), -100,
                             dtype=labels.dtype, device=device) if labels is not None else None

    for batch_idx in range(batch_size):
        image_positions = torch.where(image_token_mask[batch_idx])[0]

        if len(image_positions) > 0:
            img_pos = image_positions[0].item()

            # Copy text before <image>
            new_embeds[batch_idx, :img_pos] = text_embeds[batch_idx, :img_pos]
            # Insert all 256 image tokens
            new_embeds[batch_idx, img_pos:img_pos + num_image_tokens] = image_features[batch_idx]
            # Copy text after <image>
            remaining_len = seq_len - img_pos - 1
            new_embeds[batch_idx, img_pos + num_image_tokens:img_pos + num_image_tokens + remaining_len] = \
                text_embeds[batch_idx, img_pos + 1:img_pos + 1 + remaining_len]

            # Update attention mask
            new_attention_mask[batch_idx, :img_pos] = attention_mask[batch_idx, :img_pos]
            new_attention_mask[batch_idx, img_pos:img_pos + num_image_tokens] = 1
            new_attention_mask[batch_idx, img_pos + num_image_tokens:img_pos + num_image_tokens + remaining_len] = \
                attention_mask[batch_idx, img_pos + 1:img_pos + 1 + remaining_len]

            # Labels: mask image positions with -100 (ignore in loss)
            if labels is not None:
                new_labels[batch_idx, :img_pos] = labels[batch_idx, :img_pos]
                new_labels[batch_idx, img_pos + num_image_tokens:img_pos + num_image_tokens + remaining_len] = \
                    labels[batch_idx, img_pos + 1:img_pos + 1 + remaining_len]

    return new_embeds, new_attention_mask, new_labels
Why label image tokens as -100? Because we don’t want the model to “predict” image content — that’s the vision encoder’s job. The language model should only be penalized for generating wrong text tokens. Setting labels to -100 tells PyTorch’s CrossEntropyLoss to ignore those positions entirely.

03 The Training Pipeline with PyTorch Lightning

Why Lightning?

PyTorch Lightning removes the boilerplate from training: gradient accumulation, mixed precision, multi-GPU, logging, checkpointing — all handled by the framework. You focus on the model, not the training loop.

The Lightning Module

Python lightning_module.py
class MultimodalGemmaLightning(L.LightningModule):
    def __init__(self, config):
        super().__init__()
        self.save_hyperparameters()
        self.config = config
        self.model = MultimodalGemma(config)
        self.training_step_outputs = []
        self.validation_step_outputs = []

    def forward(self, batch):
        return self.model(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
            images=batch.get("images"),
            labels=batch["labels"]
        )

    def training_step(self, batch, batch_idx):
        outputs = self(batch)
        loss = outputs["loss"]
        self.log("train/loss", loss, on_step=True, on_epoch=True, prog_bar=True, sync_dist=True)
        self.log("train/learning_rate", self.optimizers().param_groups[0]["lr"], on_step=True)
        self.training_step_outputs.append(loss.detach())
        return loss

    def validation_step(self, batch, batch_idx):
        outputs = self(batch)
        loss = outputs["loss"]
        self.log("val/loss", loss, on_step=False, on_epoch=True, prog_bar=True, sync_dist=True)
        self.validation_step_outputs.append(loss.detach())
        return loss

Optimizer Configuration — Differential Learning Rates

Different parts of the model need different learning rates. The projector is learning from scratch, but the LoRA adapters are fine-tuning a pretrained model:

Python lightning_module.py
def configure_optimizers(self):
    param_groups = []

    # Vision projector — higher LR (learning from scratch)
    vision_proj_params = list(self.model.vision_projector.parameters())
    param_groups.append({
        "params": vision_proj_params,
        "lr": float(self.config["training"]["projector_lr"]),  # 1e-3
        "name": "vision_projector"
    })

    # LoRA adapters — lower LR (fine-tuning)
    lora_params = [p for n, p in self.model.language_model.named_parameters() if p.requires_grad]
    param_groups.append({
        "params": lora_params,
        "lr": float(self.config["training"]["lora_lr"]),  # 2e-4
        "name": "lora_adapters"
    })

    optimizer = torch.optim.AdamW(param_groups, weight_decay=0.01, eps=1e-8)

    # Linear warmup then decay
    total_steps = (len(self.trainer.datamodule.train_dataloader()) // accumulate_grad_batches) * max_epochs
    warmup_steps = int(total_steps * 0.03)

    scheduler = get_linear_schedule_with_warmup(optimizer, warmup_steps, total_steps)

    return {"optimizer": optimizer, "lr_scheduler": {"scheduler": scheduler, "interval": "step"}}
Differential learning rates are critical. The projector starts with random weights so it needs a higher LR (1e-3) to learn quickly. The LoRA adapters are modifying a pretrained model, so they need a gentle LR (2e-4) to avoid catastrophic forgetting. If you use the same LR for both, either the projector learns too slowly or the LoRA adapters forget too much.

04 Data — The LLaVA Dataset

Dataset Structure

We train on the LLaVA-Instruct-150K dataset: 157,712 image-text conversation pairs built from COCO images. Each sample has:

  • An image (from MS-COCO)
  • A multi-turn conversation (Human asks, Assistant responds)
  • Visual grounding (the conversation references specific image content)

The Data Pipeline

Python datamodule.py
class LLaVADataModule(L.LightningDataModule):
    def __init__(self, tokenizer, vision_processor, config):
        super().__init__()
        self.tokenizer = tokenizer
        self.vision_processor = vision_processor
        self.batch_size = config["training"]["batch_size"]
        self.num_workers = config["data"].get("num_workers", 4)
        self.val_size = config["data"].get("val_size", 0.02)

        self.collator = MultimodalCollator(
            tokenizer=self.tokenizer,
            vision_processor=self.vision_processor,
            config=config
        )

    def setup(self, stage=None):
        full_dataset = LLaVADataset(config=self.config, split="train")
        total_size = len(full_dataset)
        val_size = int(total_size * self.val_size)
        train_size = total_size - val_size

        self.train_dataset, self.val_dataset = random_split(
            full_dataset, [train_size, val_size],
            generator=torch.Generator().manual_seed(42)
        )

    def train_dataloader(self):
        return DataLoader(
            self.train_dataset,
            batch_size=self.batch_size,
            shuffle=True,
            num_workers=self.num_workers,
            pin_memory=True,
            persistent_workers=True,
            prefetch_factor=2,
            collate_fn=self.collator,
            drop_last=True
        )

05 Training Configuration with Hydra

Why Hydra?

Hardcoded hyperparameters are a recipe for chaos. Hydra lets you define all configuration in YAML files and override anything from the command line:

Bash terminal
# Default training
uv run python train.py

# Quick test with subset
uv run python train.py data.use_subset=true data.subset_size=50000 training.max_epochs=1

# Full training with MLflow
uv run python train.py data.use_subset=false training.max_epochs=3 training.batch_size=20 logging.use_mlflow=true

The Training Script

Python train.py
torch.set_float32_matmul_precision('high')  # Enable Tensor Cores

@hydra.main(version_base=None, config_path="configs", config_name="config")
def hydra_main(cfg):
    config = OmegaConf.to_container(cfg, resolve=True)

    L.seed_everything(42)

    model = MultimodalGemmaLightning(config)
    datamodule = LLaVADataModule(
        tokenizer=model.model.tokenizer,
        vision_processor=model.model.vision_processor,
        config=config
    )

    callbacks = [
        RichProgressBar(),
        ModelCheckpoint(monitor="val/loss", save_top_k=2, save_last=True),
        EarlyStopping(monitor="val/loss", patience=3),
        LearningRateMonitor(logging_interval="step"),
    ]

    trainer = Trainer(
        accelerator="auto",
        devices="auto",
        max_epochs=config["training"]["max_epochs"],
        accumulate_grad_batches=2,
        gradient_clip_val=1.0,
        precision="bf16-mixed",
        callbacks=callbacks,
        logger=[MLFlowLogger(...), TensorBoardLogger(...)],
    )

    trainer.fit(model, datamodule)
    trainer.save_checkpoint("final_model.ckpt")

Training Configuration

Parameter Value Why
Dataset LLaVA-Instruct-150K 157,712 image-text pairs from COCO
Epochs 3 More would overfit on this dataset size
Batch Size 20 (effective 40) Largest that fits in A100 40GB with this model
Precision bf16-mixed Halves memory, ~50% faster, negligible quality loss
Gradient Clipping 1.0 Prevents exploding gradients during multimodal fusion
Warmup 3% of total steps Stabilizes early training when projector is random
Projector LR 1e-3 Learning from scratch needs higher LR
LoRA LR 2e-4 Fine-tuning needs gentler LR
Weight Decay 0.01 Standard regularization
LoRA Rank 64 Good capacity without excessive parameters
LoRA Alpha 128 Effective scaling = 2.0

06 Training Results — What 9 Hours on an A100 Gets You

Training Metrics

Epoch Train Loss Val Loss Notes
1 1.892 1.598 Projector learning fast, model starts grounding
2 1.456 1.462 Loss converging, responses getting coherent
3 1.333 1.430 Best val loss, slight overfitting beginning

What the Model Learned

After 9 hours of training, the model can:

  • Identify animals: “Two cats lying on couch with one on left side, other on right side...”
  • Understand rooms: “Modern spacious kitchen with yellow walls, wood floors, dining table, refrigerator...”
  • Recognize food: “Close-up donut in plastic bag on table between two bananas...”
  • Describe activities: “Lively skate park scene with multiple skateboarders practicing tricks...”

Benchmark Results

Benchmark Score Notes
Basic VQA 53.8% (7/13) Animal and room identification strong
POPE Hallucination 80% accuracy 20% hallucination rate (yes-bias typical for small models)
A 53.8% VQA accuracy from a 270M model trained for 9 hours is actually impressive. For context, the original LLaVA used a 7B model (26× larger) trained for much longer. Our tiny model demonstrates that the architecture works — scaling up the base model and training data would dramatically improve results.

GPU Cost Analysis

GPU VRAM Batch Size Training Time Cost/hr Total Cost
A10 24GB 8-12 ~15-18 hrs $0.75 ~$12.75
A100 (40GB) 40GB 16-20 ~9 hrs $1.29 ~$11.61
A100 (80GB) 80GB 24-32 ~6-7 hrs $1.99 ~$13.30
H100 80GB 32-48 ~4-5 hrs $2.49 ~$11.20

07 Evaluation — Measuring What the Model Sees

Basic VQA Evaluation

Python evaluate.py
def generate_answer(model, image, question, device):
    image_inputs = model.vision_processor(images=image, return_tensors="pt")
    pixel_values = image_inputs["pixel_values"].to(device)

    prompt = f"<image>\nHuman: {question}\nAssistant:"
    text_inputs = model.tokenizer(prompt, return_tensors="pt", padding=True,
                                   truncation=True, max_length=512)
    input_ids = text_inputs["input_ids"].to(device)
    attention_mask = text_inputs["attention_mask"].to(device)

    with torch.no_grad():
        outputs = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            images=pixel_values,
            max_new_tokens=30,
            temperature=0.0,
            do_sample=False,
        )

    response = model.tokenizer.decode(outputs[0], skip_special_tokens=True)
    if "Assistant:" in response:
        response = response.split("Assistant:", 1)[1]
    return response.lower().strip()

POPE Hallucination Test

The POPE (Polling-based Object Probing Evaluation) test checks if the model hallucinates objects that aren’t in the image:

Python evaluate.py
def run_pope_mini(checkpoint_path, num_samples=100):
    POPE_TESTS = [
        {
            "path": "samples/test_images/sample_001.jpg",
            "present": ["cat", "couch"],           # should answer "yes"
            "absent": ["dog", "elephant", "car"],   # should answer "no"
        },
    ]

    for test in POPE_TESTS:
        image = Image.open(test["path"]).convert("RGB")

        for obj in test["present"]:
            question = f"Is there a {obj} in this image? Answer yes or no."
            answer = generate_answer(model, image, question, device)
            if "yes" in answer: tp += 1
            else: fn += 1

        for obj in test["absent"]:
            question = f"Is there a {obj} in this image? Answer yes or no."
            answer = generate_answer(model, image, question, device)
            if "no" in answer: tn += 1
            else: fp += 1  # hallucination!

    accuracy = (tp + tn) / (tp + fp + tn + fn)
    hallucination_rate = fp / (fp + tn)

08 MLOps — The Full CI/CD Pipeline

GitHub Actions Workflow

Every push to main triggers an automated pipeline:

Push to main
    │
    ▾
Tests (pytest + ruff linting)
    │
    ▾
Download checkpoint from HuggingFace Hub
    │
    ▾
Export model for deployment
    │
    ▾
Deploy to HuggingFace Spaces (Gradio app)
YAML train_deploy.yml
name: MLOps Pipeline - Test, Trace & Deploy

on:
  push:
    branches: [main]
    paths: ['src/**', 'configs/**', 'hf_space/**']
  workflow_dispatch:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install UV
        run: curl -LsSf https://astral.sh/uv/install.sh | sh
      - name: Lint with ruff
        run: ruff check src/ --ignore E501,F401
      - name: Run tests
        run: pytest tests/ -v --tb=short

  trace-and-deploy:
    needs: test
    if: github.ref == 'refs/heads/main'
    steps:
      - name: Login to HuggingFace
        run: python -c "from huggingface_hub import login; login(token=os.environ['HF_TOKEN'])"
      - name: Download checkpoint
        run: python scripts/download_checkpoint.py
      - name: Export model
        run: python src/trace_model.py --ckpt_path final_model.ckpt --output_path hf_space/model.pt
      - name: Deploy to HuggingFace Spaces
        run: python scripts/deploy_to_hf.py

Tools Used

Tool Purpose Why
PyTorch Lightning Training framework Clean code, automatic optimization, multi-GPU
Hydra Configuration YAML configs, command-line overrides
MLflow Experiment tracking Loss curves, hyperparameters, model versioning
DVC Data versioning Track 157K training samples across experiments
GitHub Actions CI/CD Automated testing and deployment on every push
HuggingFace Spaces Deployment Free GPU inference, Gradio UI, public demo
Docker Containerization Reproducible environments
Weights & Biases (Optional) logging Real-time training visualization

09 Deployment — From Checkpoint to Live Demo

The Gradio App

Python app.py
def predict_with_image(image, question, max_tokens=100, temperature=0.7):
    if not isinstance(image, Image.Image):
        image = Image.fromarray(image).convert('RGB')

    vision_inputs = vision_processor(images=[image], return_tensors="pt")
    pixel_values = vision_inputs["pixel_values"].to(device)

    prompt = f"<image>\nHuman: {question}\nAssistant:"
    text_inputs = tokenizer(prompt, return_tensors="pt", padding=True,
                            truncation=True, max_length=256)
    input_ids = text_inputs["input_ids"].to(device)
    attention_mask = text_inputs["attention_mask"].to(device)

    with torch.no_grad():
        outputs = model.model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            images=pixel_values,
            max_new_tokens=min(max_tokens, 150),
            temperature=min(max(temperature, 0.1), 2.0),
            do_sample=temperature > 0.1,
            repetition_penalty=1.1
        )

    generated_tokens = outputs[0][input_ids.shape[1]:]
    response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
    return response.strip()

Live Demo

The model is deployed at huggingface.co/spaces/sagar007/Multimodal-Gemma. Upload any image, ask a question, and see the model respond in real-time.

10 Project Structure — How It All Fits Together

Bash project layout
multimodal-gemma-270m/
├── .github/workflows/train_deploy.yml    # CI/CD pipeline
├── configs/
│   ├── config.yaml                       # Main Hydra config
│   ├── model_config.yaml                 # Architecture settings
│   ├── training_config.yaml              # Hyperparameters
│   └── data_config.yaml                  # Dataset config
├── src/
│   ├── models/
│   │   ├── lightning_module.py           # PyTorch Lightning trainer
│   │   ├── multimodal_gemma.py           # Core model architecture
│   │   └── projectors.py                 # Vision projector MLP
│   ├── data/datamodule.py                # Lightning DataModule
│   └── utils/config.py                   # Configuration utilities
├── hf_space/
│   ├── app.py                            # Gradio application
│   └── requirements.txt                  # Space dependencies
├── samples/test_images/                  # Evaluation images
├── train.py                              # Training entry point
├── inference.py                          # Inference script
├── evaluate.py                           # Benchmark evaluation
├── gradio_app.py                         # Local web interface
├── dvc.yaml                              # DVC pipeline
├── Dockerfile                            # Container image
├── Makefile                              # Convenience commands
└── pyproject.toml                        # Project metadata
The clean separation between src/models/, src/data/, and configs/ is intentional. You should be able to swap out the language model (e.g., replace Gemma with Phi or Qwen), change the dataset (e.g., switch to ShareGPT), or modify training hyperparameters — all without touching the core model code.

Summary: What I Learned Building This

  • 1. Multimodal fusion is simpler than you think — A 2-layer MLP projector is all you need to bridge vision and language. The pretrained models do the heavy lifting.
  • 2. LoRA makes everything practical — Training only 3.4% of parameters (18.6M out of 539M) means you can fine-tune on a single GPU in hours, not days.
  • 3. The merge function is everything — How you interleave image and text tokens determines what the model can learn. Getting the attention masks and label masking right is critical.
  • 4. Small models teach big lessons — A 270M model won’t win benchmarks, but it will teach you every architectural decision that matters. Scale is a knob you turn later.
  • 5. MLOps completes the loop — A model that only runs in your notebook isn’t useful. CI/CD, experiment tracking, and one-click deployment turn a research project into a product.
  • 6. Open source everything — The full code, model weights, and training logs are available. If one person learns from this, it was worth the 9 hours of A100 time.
Building a multimodal model from scratch isn't about competing with GPT-4V — it's about
understanding every decision from pixels to predictions. The code is at
github.com/sagar431/multimodal-gemma-270m. Clone it, train it, break it, fix it.