NEO: How to Build Vision-Language Models That Actually Understand Images

Posted on Oct 25, 2025

On October 2025, researchers from S-Lab (NTU), Xi’an Jiaotong University, and SenseTime published NEO: a vision-language model that matches GPT-4V performance while using only 390 million training examples instead of billions.

More importantly, they proved native VLMs (vision and language from scratch) can compete with modular systems (bolting vision onto existing LLMs).

This isn’t just an academic achievement. It’s a blueprint for building efficient multimodal AI without Google-scale infrastructure.

The Problem: Why Most Vision-Language Models Are Frankenstein Creations

Here’s how 90% of vision-language models work today:

Step 1: Take a powerful image model (like CLIP or a vision transformer) Step 2: Take a powerful language model (like GPT or LLaMA) Step 3: Bolt them together with an “adapter” layer Step 4: Hope they learn to talk to each other

Examples: GPT-4V, LLaVA, InternVL, Qwen-VL

Why This Works (Sort Of)

Pros:

  • Leverage pre-trained components (billions of dollars of training)
  • Fast to build (weeks instead of years)
  • Can swap components easily

Cons:

  • Vision and language speak different “languages” internally
  • Adapter layer is a bottleneck
  • Not truly unified understanding
  • Requires massive alignment datasets

The Alternative: Native Vision-Language Models

What if you trained one model that understands pixels and words from the beginning?

This is NEO’s approach - and it’s working.


What Makes NEO Different: Native Vision-Language Primitives

NEO doesn’t bolt components together. It builds vision-language understanding from the ground up using three core innovations.

Innovation 1: Native-RoPE (Rotary Position Embeddings)

The Problem:

  • Text is 1D (sequence: “The cat sat on the mat”)
  • Images are 2D (grid: 512×512 pixels)
  • Standard language models can’t handle 2D position information

NEO’s Solution: Native-RoPE

Allocates position information across three dimensions:

  • T (Temporal): Sequence position for text
  • H (Height): Row position for images
  • W (Width): Column position for images

How it works:

Text token "cat":
  T = 3 (third word)
  H = 0 (not an image)
  W = 0 (not an image)

Image patch at row 15, col 23:
  T = constant (same "time" for all patches)
  H = 15 (row 15)
  W = 23 (column 23)

Result: Model natively understands 2D spatial relationships in images while still processing text sequences properly.

Innovation 2: Mixed Attention Strategy

Different content types need different attention patterns:

For text:

  • Causal attention (can’t see future words)
  • Enables autoregressive generation
  • “The cat [can’t see sat, on, mat yet]”

For images:

  • Bidirectional attention (can see entire image)
  • Enables spatial understanding
  • “Patch at (15,23) can see all other patches”

NEO implements both simultaneously - text gets causal, images get bidirectional.

Impact: +0.7% accuracy over pure causal attention across all benchmarks.

Innovation 3: Pre-Buffer Architecture

The challenge: How to learn visual features without forgetting language?

NEO’s approach:

Stage 1 (Pre-training): Split the model temporarily

  • Pre-Buffer: 12 layers learn visual encoding
  • LLM layers: Frozen (preserve language knowledge)
  • Train on 345M image-text pairs

Stage 2 (Mid-training): Merge everything

  • Pre-Buffer + LLM = unified model
  • Train end-to-end on 40M high-quality examples
  • Full vision-language integration

Stage 3 (Fine-tuning): Optimize for tasks

  • 4M instruction examples
  • Visual QA, dialogue, mathematics, reasoning

Why this works:

  • Preserves linguistic ability (frozen LLM during pre-training)
  • Learns visual features efficiently (dedicated pre-Buffer)
  • Unifies understanding (merged end-to-end training)

Training NEO: How to Build a Vision-Language Model

Stage 1: Pre-Training (345M Examples)

Goal: Learn to connect pixels to words

Dataset Composition:

SourceCountPurpose
LAION-400M120MWeb-scale diversity
COYO-700M150MHigh-quality captions
BLIP3o20MSynthetic captions
OpenImages5MRecaptioned with detail
LAION-COCO30MNatural scenes
Wukong20MChinese + OCR text

Training setup:

  • Freeze LLM weights (preserve language)
  • Train Pre-Buffer (12 layers) + adapter
  • 3:7 language-to-multimodal data ratio
  • Learning rate: 8×10⁻⁴
  • Hardware: 128 GPUs (80GB each)

What the model learns:

  • Basic object recognition
  • Spatial relationships
  • Text in images (OCR)
  • Image-caption alignment

Stage 2: Mid-Training (40M Examples)

Goal: Strengthen vision-language reasoning

Dataset Composition:

Task TypePercentageExamples
Image Captioning66%Describe this image
Conversation11%Multi-turn dialogue
Detection8%Find objects/regions
OCR15%Read text in images

Training changes:

  • Unfreeze everything (end-to-end optimization)
  • Merge Pre-Buffer into unified architecture
  • Learning rate: 4×10⁻⁵ (lower for stability)
  • Resolution: 256×256 to 2,048×2,048

What the model learns:

  • Complex reasoning
  • Multi-turn conversations
  • Precise localization
  • Document understanding

Stage 3: Supervised Fine-Tuning (4M Examples)

Goal: Master specific tasks

Dataset Composition:

  • Visual question answering
  • Multimodal dialogue
  • Mathematical reasoning
  • Knowledge-intensive QA
  • Bilingual (English + Chinese)

Training setup:

  • Learning rate: 5×10⁻⁵
  • Full high-resolution support
  • Task-specific optimization

What the model learns:

  • Following instructions precisely
  • Answering complex questions
  • Mathematical problem-solving
  • Cross-lingual understanding

Performance: NEO vs. The World

NEO-2.2B (Lightweight Model)

BenchmarkNEO-2.2BInternVL2.5 (2B)Difference
MMMU (reasoning)48.648.6Tied
MMBench (general)76.074.7+1.3 ✅
MMVet (real-world)49.650.3-0.7
ChartQA (charts)81.279.1+2.1 ✅
TextVQA (OCR)74.071.5+2.5 ✅

Key insight: Native 2B model matches or beats modular 2B models despite simpler architecture.

NEO-9B (Larger Model)

BenchmarkNEO-9BInternVL3 (8B)Difference
MMMU54.662.7-8.1
MMBench82.183.4-1.3
SEED-Bench76.376.2+0.1 ✅
ChartQA82.183.9-1.8
DocVQA88.690.6-2.0

Key insight: Competitive with larger modular models, room for improvement on knowledge-intensive tasks.

The Training Data Difference

NEO: 390M total training examples Typical Modular VLM: 1-5B+ training examples

NEO achieves 90-95% of modular VLM performance with 10-20x less training data.

Why? Native architecture is more efficient - no alignment overhead between separate vision/language components.


Architecture Deep-Dive: How NEO Actually Works

Input Processing

For text input: “Describe this image”

# Tokenize text
tokens = tokenizer("Describe this image")
# Output: [Describe, this, image]

# Add position embeddings
for i, token in enumerate(tokens):
    token.T = i  # Sequence position
    token.H = 0  # Not an image
    token.W = 0  # Not an image

For image input: 512×512 photo

# Divide into 32×32 patches (16×16 stride)
patches = image.to_patches(patch_size=16)
# Output: 32×32 = 1,024 patches

# Add position embeddings
for row in range(32):
    for col in range(32):
        patch = patches[row][col]
        patch.T = 0  # Constant for all patches
        patch.H = row  # Row position
        patch.W = col  # Column position

Unified Token Sequence

After processing, NEO sees:

[Describe, this, image, <patch_0_0>, <patch_0_1>, ..., <patch_31_31>]
 ^text^   ^text^ ^text^  ^image patches (1,024 total)^

All tokens flow through the same transformer layers.

Attention Mechanism

# Mixed attention based on token type
for token in sequence:
    if token.is_text:
        # Causal attention (can't see future)
        attention_mask = causal_mask(token.position)
    elif token.is_image:
        # Bidirectional attention (can see all patches)
        attention_mask = full_mask()

    # Apply attention with appropriate mask
    output = attention(token, mask=attention_mask)

Generation Process

User: “What’s in this image?” NEO:

  1. Processes image patches (bidirectional attention)
  2. Processes text question (causal attention)
  3. Generates answer token-by-token (causal attention)
  4. Each new token attends to all previous tokens + all image patches

Output: “The image shows a golden retriever playing in a park with a red ball.”


How to Use NEO: Practical Guide

Installation

# Clone repository
git clone https://github.com/EvolvingLMMs-Lab/NEO.git
cd NEO

# Install dependencies
pip install -r requirements.txt

# Download model weights (Hugging Face)
# NEO-2B: https://huggingface.co/EvolvingLMMs-Lab/NEO-2B-SFT
# NEO-9B: https://huggingface.co/EvolvingLMMs-Lab/NEO-9B-SFT

Basic Usage

from transformers import AutoModel, AutoTokenizer
from PIL import Image

# Load model
model = AutoModel.from_pretrained("EvolvingLMMs-Lab/NEO-2B-SFT")
tokenizer = AutoTokenizer.from_pretrained("EvolvingLMMs-Lab/NEO-2B-SFT")

# Load image
image = Image.open("example.jpg")

# Ask question
question = "What's in this image?"

# Generate response
response = model.generate(
    image=image,
    text=question,
    max_length=200
)

print(response)
# Output: "The image shows a golden retriever playing..."

Advanced Usage: Multi-Turn Conversation

# Start conversation
conversation = []

# Turn 1
conversation.append({
    "role": "user",
    "content": "What color is the dog?",
    "image": image
})

response1 = model.generate(conversation)
# Output: "The dog is golden/tan colored."

conversation.append({
    "role": "assistant",
    "content": response1
})

# Turn 2 (continues conversation)
conversation.append({
    "role": "user",
    "content": "What is it holding?"
})

response2 = model.generate(conversation)
# Output: "The dog is holding a red ball in its mouth."

Custom Fine-Tuning

from neo.training import train

# Prepare your dataset
dataset = [
    {
        "image": "path/to/image1.jpg",
        "conversations": [
            {"role": "user", "content": "Question about image?"},
            {"role": "assistant", "content": "Answer based on image."}
        ]
    },
    # ... more examples
]

# Fine-tune on your domain
train(
    model="EvolvingLMMs-Lab/NEO-2B-SFT",
    dataset=dataset,
    epochs=3,
    learning_rate=5e-5,
    output_dir="./finetuned_model"
)

Use Cases: What You Can Build with NEO

1. Visual Question Answering System

Example: Customer support for e-commerce

# Customer uploads product image
image = customer_upload("broken_product.jpg")

# Ask: "What's wrong with this product?"
diagnosis = neo.generate(
    image=image,
    text="Analyze this product image and identify any defects or damage."
)

# Output: "The product shows a crack on the left side of the casing..."

Value: Automated visual triage before human support

2. Document Understanding

Example: Invoice processing

# Process scanned invoice
invoice_image = Image.open("invoice_2025.jpg")

# Extract information
data = neo.generate(
    image=invoice_image,
    text="Extract the invoice number, date, total amount, and line items from this invoice. Return as JSON."
)

# Output: {"invoice_number": "INV-2025-001", "date": "2025-10-25", ...}

Value: 10-100x faster than manual data entry

3. Medical Image Analysis

Example: Radiology assistance

# X-ray image
xray = Image.open("chest_xray.jpg")

# Analysis
findings = neo.generate(
    image=xray,
    text="Describe any abnormalities visible in this chest X-ray. Note: This is for educational purposes only."
)

# Output: "The image shows signs of possible infiltration in the lower right lobe..."

Value: Pre-screening to prioritize urgent cases

4. Content Moderation

Example: Social media safety

# User-uploaded image
user_image = Image.open("user_post.jpg")

# Check for violations
moderation = neo.generate(
    image=user_image,
    text="Analyze this image for: violence, explicit content, hate symbols, or dangerous activities. Return risk level (low/medium/high) and explanation."
)

# Output: {"risk": "low", "explanation": "Image shows a landscape photo..."}

Value: Automated first-pass moderation

5. Educational Assistant

Example: Homework help

# Student uploads math problem photo
problem_image = Image.open("math_problem.jpg")

# Solve step-by-step
solution = neo.generate(
    image=problem_image,
    text="Solve this math problem step by step, explaining each step clearly."
)

# Output: "Step 1: Identify the equation... Step 2: Isolate the variable..."

Value: 24/7 tutoring assistance


Limitations & When NOT to Use NEO

1. Knowledge-Intensive Tasks

Problem: NEO underperforms on MMMU (reasoning benchmark) compared to GPT-4V

Example:

  • ❌ “Explain the historical significance of this architectural style”
  • ✅ “What architectural elements are visible in this building?”

Why: Training corpus emphasizes visual-linguistic alignment over encyclopedic knowledge

Mitigation: Combine with RAG (Retrieval-Augmented Generation) for knowledge queries

2. Very High-Resolution Images

Problem: Computational cost scales with image resolution

Current support: Up to 2,048×2,048 pixels Practical limit: ~1,024×1,024 for real-time inference

Workaround: Tile large images or use image pyramids

3. Real-Time Video Analysis

Problem: Not optimized for video streams

Current: Single image processing Missing: Temporal modeling across frames

Alternative: Process keyframes separately, use specialized video models

4. Highly Specialized Domains (Out-of-the-Box)

Problem: General training data doesn’t cover niche domains

Examples:

  • Satellite imagery analysis
  • Microscopy images
  • Industrial defect detection

Solution: Fine-tune on domain-specific data (hundreds to thousands of examples)


NEO vs. Competition: What’s Different?

NEO vs. LLaVA

AspectNEOLLaVA
ArchitectureNative (unified)Modular (CLIP + LLaMA)
Position EncodingNative-RoPE (2D spatial)1D sequence only
AttentionMixed (causal + bidirectional)Pure causal
Training Data390M examples1.2M (LLaVA-1.5)
PerformanceCompetitiveStrong baseline

Winner: NEO for efficiency, LLaVA for simplicity

NEO vs. GPT-4V

AspectNEOGPT-4V
AccessOpen sourceClosed API
CostSelf-host ($0-200/month)$0.01-0.10 per image
CustomizationFull fine-tuningPrompt engineering only
Performance85-90% of GPT-4VBest-in-class
Latency<2s (local GPU)3-10s (API call)

Winner: GPT-4V for performance, NEO for control/cost

NEO vs. InternVL

AspectNEOInternVL
ArchitectureNativeModular
Model Sizes2B, 9B2B, 8B, 26B
Training Data390M1B+
Performance90-95% of InternVLState-of-the-art
Resource RequirementsLowerHigher

Winner: Tie - different trade-offs for different needs


The Research Implications: Why This Matters

1. Native VLMs Are Viable

Previous belief: Need separate vision encoders (CLIP, ViT) to match modular VLMs

NEO proves: Native architecture with proper primitives (Native-RoPE, mixed attention) can compete

Impact: Simplifies VLM development, reduces engineering complexity

2. Data Efficiency Through Architecture

Key finding: NEO achieves 90-95% performance with 10-20x less training data

Why: Native architecture doesn’t waste capacity on vision-language alignment

Implication: Smaller teams can build competitive VLMs without Google-scale compute

3. Reusable Components

NEO releases:

  • Pre-Buffer architecture (drop-in visual encoding)
  • Native-RoPE implementation (position encoding)
  • Mixed attention mechanisms

Value: Other researchers can build on these primitives

4. Future of Multimodal AI

Trend: Moving from modular (bolt components together) to native (integrated from scratch)

Examples:

  • Text + images (NEO)
  • Text + audio (Gemini)
  • Text + images + audio + video (future)

NEO provides roadmap for building truly unified multimodal models


Implementation Guide: Building Your Own NEO-Based System

Week 1: Setup and Testing

Day 1-2: Environment setup

# Clone repository
git clone https://github.com/EvolvingLMMs-Lab/NEO.git

# Install dependencies
pip install -r requirements.txt
pip install torch torchvision transformers

# Download model weights
huggingface-cli download EvolvingLMMs-Lab/NEO-2B-SFT

Day 3-4: Test basic functionality

# Run provided examples
python examples/image_qa.py
python examples/conversation.py
python examples/document_understanding.py

# Benchmark on your hardware
python benchmark.py --model NEO-2B-SFT --batch-size 1

Day 5: Evaluate on your domain

# Prepare test set (50-100 examples)
test_data = load_your_domain_images()

# Run evaluation
results = evaluate_model(
    model="NEO-2B-SFT",
    test_data=test_data,
    metrics=["accuracy", "latency", "memory"]
)

Decision point: Proceed if accuracy >70% on your test set

Week 2: Fine-Tuning (If Needed)

Prepare training data:

# Format: JSON with image paths and conversations
training_data = [
    {
        "image": "path/to/image1.jpg",
        "conversations": [
            {"role": "user", "content": "Question?"},
            {"role": "assistant", "content": "Answer."}
        ]
    },
    # 500-5,000 examples recommended
]

Fine-tune model:

python train.py \
    --model NEO-2B-SFT \
    --data training_data.json \
    --epochs 3 \
    --learning-rate 5e-5 \
    --batch-size 4 \
    --output finetuned_neo

Evaluate improvement:

# Compare base vs. fine-tuned
base_results = evaluate("NEO-2B-SFT", test_data)
finetuned_results = evaluate("finetuned_neo", test_data)

print(f"Improvement: {finetuned_results.accuracy - base_results.accuracy:.1%}")

Week 3: Production Deployment

Option 1: Local deployment

from flask import Flask, request
import torch

app = Flask(__name__)
model = load_neo_model("finetuned_neo")

@app.route('/analyze', methods=['POST'])
def analyze_image():
    image = request.files['image']
    question = request.form['question']

    response = model.generate(
        image=image,
        text=question,
        max_length=200
    )

    return {"answer": response}

app.run(host='0.0.0.0', port=8000)

Option 2: Cloud deployment (AWS/GCP)

# docker-compose.yml
version: '3'
services:
  neo-api:
    image: neo-api:latest
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Option 3: Serverless (Modal, RunPod)

import modal

stub = modal.Stub("neo-inference")

@stub.function(
    gpu="A10G",
    image=modal.Image.debian_slim().pip_install("transformers", "torch")
)
def generate(image_url: str, question: str):
    model = load_neo_model()
    return model.generate(image_url, question)

Week 4: Monitoring and Optimization

Set up monitoring:

import prometheus_client as prom

# Track metrics
inference_time = prom.Histogram('neo_inference_seconds', 'Time to generate response')
accuracy_gauge = prom.Gauge('neo_accuracy', 'Model accuracy on validation set')
error_counter = prom.Counter('neo_errors_total', 'Total errors')

# Monitor in production
@inference_time.time()
def monitored_generate(image, text):
    try:
        response = model.generate(image, text)
        return response
    except Exception as e:
        error_counter.inc()
        raise

Optimize performance:

# Enable mixed precision
model = model.half()  # FP16 inference

# Batch processing
images = [image1, image2, image3]
questions = [q1, q2, q3]
responses = model.generate_batch(images, questions)

# Quantization (if needed)
from torch.quantization import quantize_dynamic
model = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)

Cost Analysis: Running NEO in Production

Infrastructure Costs

NEO-2.2B (Lightweight):

  • GPU: RTX 4090 (24GB) or A10G (24GB)
  • Cloud cost: $0.50-1.50/hour (AWS/GCP spot instances)
  • Self-hosted: ~$1,500-2,000 (consumer GPU) or $10K-15K (data center GPU)

NEO-9B (Full-size):

  • GPU: A100 (40GB) or H100 (80GB)
  • Cloud cost: $2-5/hour (AWS/GCP spot instances)
  • Self-hosted: ~$30K-50K (data center GPU)

Operational Costs (Monthly)

Small scale (1,000 requests/day):

  • Cloud GPU rental: $360-1,080/month
  • Bandwidth: $10-50/month
  • Storage: $5-10/month
  • Total: $375-1,140/month

Medium scale (10,000 requests/day):

  • Cloud GPU rental: $720-2,160/month (2-4 instances)
  • Bandwidth: $50-200/month
  • Storage: $20-50/month
  • Total: $790-2,410/month

Large scale (100,000 requests/day):

  • Cloud GPU rental: $3,600-10,800/month (10-20 instances + load balancer)
  • Bandwidth: $200-1,000/month
  • Storage: $100-500/month
  • Total: $3,900-12,300/month

Cost Comparison: NEO vs. API Services

Scenario: 10,000 image analysis requests/month

ServiceCost per RequestMonthly CostNotes
GPT-4V API$0.01-0.10$100-1,000Variable pricing
Google Gemini$0.002-0.02$20-200Cheaper but less capable
NEO Self-Hosted$0.01-0.05$100-500Fixed GPU rental
NEO Cloud (spot)$0.02-0.08$200-800Scalable

Breakeven analysis:

  • <5K requests/month: Use APIs (simpler)
  • 5K-50K requests/month: Consider self-hosting
  • 50K requests/month: Self-host wins (10-50x cheaper)


The Bottom Line: When to Use NEO

✅ Use NEO When:

1. You need control

  • Fine-tune on proprietary data
  • Custom deployment requirements
  • Data privacy regulations (HIPAA, GDPR)

2. You need cost efficiency at scale

  • 10K requests/day

  • Long-term production deployment
  • Predictable costs matter

3. You need low latency

  • Real-time applications (<2s response)
  • Edge deployment (on-device)
  • Batch processing large volumes

4. You want cutting-edge research

  • Building next-gen multimodal systems
  • Experimenting with native VLM architectures
  • Contributing to open research

❌ Don’t Use NEO When:

1. You need the absolute best performance

  • GPT-4V still wins on complex reasoning
  • InternVL3 better on knowledge-intensive tasks

2. You have low volume

  • <1K requests/month: API services cheaper and simpler
  • No need to manage infrastructure

3. You need video understanding

  • NEO optimized for images, not video streams
  • Use specialized video models

4. You lack technical resources

  • No ML engineering team
  • No GPU infrastructure
  • Prefer managed services

Learning Resources & Next Steps

Official Resources

Hands-On Tutorials

  1. Getting Started: Run NEO in 10 minutes
  2. Fine-Tuning Guide: Adapt to your domain
  3. Production Deployment: Scale to thousands of requests/day

Community

  • GitHub Issues: Report bugs, request features
  • Hugging Face Forums: Ask questions, share results
  • Research Discord: Discuss with researchers

Key Takeaways

  1. Native VLMs are viable - NEO proves unified architecture can compete with modular systems
  2. Data efficiency matters - 390M examples rivals models trained on billions
  3. Open source wins - Full control, customization, and no API costs
  4. Architecture innovation - Native-RoPE + mixed attention = core primitives for future VLMs
  5. Production ready - 2B and 9B models deployable today

Bottom line: NEO democratizes vision-language AI. You don’t need Google-scale resources to build powerful multimodal systems anymore.

What you can do this weekend:

  1. Clone the repo: git clone https://github.com/EvolvingLMMs-Lab/NEO.git
  2. Download NEO-2B: Test on your images
  3. Evaluate on your domain: See if it works for your use case
  4. Fine-tune (if needed): 500-5K examples for domain adaptation

The future of multimodal AI is native, efficient, and open. NEO shows the way.


Questions? Feedback?

Tried NEO and have results to share?


Published: October 25, 2025 Author: Jeremy Longshore Reading time: 25 minutes Paper: From Pixels to Words (arXiv:2510.14979) Code: Apache 2.0 License