arXiv 2026

IKEA-Bench

Benchmarking and Mechanistic Analysis of Vision-Language Models
for Cross-Depiction Assembly Instruction Alignment

Zhuchenyang Liu · Yao Zhang · Yu Xiao
Aalto University

1,623
Benchmark Questions
19
VLMs Evaluated
6
Task Types
3
Alignment Strategies

Bridging Diagrams and Reality

Can VLMs understand wordless assembly diagrams well enough to match them with real-world video? IKEA-Bench systematically evaluates this cross-depiction alignment ability across 29 furniture products.

IKEA-Bench teaser: assembly diagrams vs. video frames

Assembly instruction diagrams (left) use schematic line drawings to depict steps, while real-world videos (right) show the same actions in photorealistic form. Bridging this depiction gap is the core challenge.

Six Task Types, Two Dimensions

Tasks span cross-modal alignment (matching diagrams to video) and procedural reasoning (understanding assembly sequences).

Code Task Description Type N
T1 Step Recognition Which diagram matches the action in the video? 4-way MC 320
T2 Action Verification Does this video match this diagram? Binary 350
T3 Progress Tracking Which step in the full sequence is happening now? 4-way MC 334
T4 Next-Step Prediction What diagram comes after the current video action? 4-way MC 204
D1 Video Discrimination Do two video clips show the same assembly step? Binary 350
D2 Instruction Comprehension What is the correct order of three shuffled diagrams? 4-way MC 65
Task examples for all 6 types

Leaderboard

Accuracy (%) under the Visual baseline setting. Even the best model reaches only 65.9% average — well below human-level performance on these tasks.

# Model Params T1 T2 T3 T4 D1 D2 Avg
🥇Gemini-3-Flash-65.368.665.643.171.181.565.9
🥈Gemini-3.1-Pro-62.865.165.041.767.476.963.2
🥉Qwen3.5-27B27B59.462.959.341.263.770.859.6
4InternVL3.5-38B38B54.461.447.337.761.467.755.0
5Qwen3.5-9B9B57.863.746.738.263.158.554.7
6Qwen3-VL-8B8B53.156.649.439.758.358.552.6
7Qwen3-VL-30B-A3B30B48.858.350.634.360.056.951.5
8GLM-4.1V-9B9B48.455.743.735.850.347.746.9
9MiniCPM-V-4.58B49.755.741.032.850.050.846.7
10Qwen2.5-VL-7B7B49.150.935.036.846.053.845.3
11Gemma3-27B27B43.155.737.131.453.741.543.8
12InternVL3.5-8B8B39.453.736.531.449.450.843.5
13Qwen2.5-VL-3B3B42.851.135.628.948.352.343.2
14Qwen3.5-2B2B44.456.632.936.351.436.943.1
15Qwen3-VL-2B2B42.250.029.634.850.026.238.8
16Gemma3-12B12B35.349.735.928.449.132.338.5
17Gemma3-4B4B39.450.327.829.447.720.035.8
18LLaVA-OV-8B8B35.346.327.829.441.427.734.7
19InternVL3.5-2B2B33.450.329.923.048.620.034.2
-Random-25.050.025.025.050.025.033.3

What We Discovered

Three key insights from evaluating 19 VLMs across 3 alignment strategies.

👁

Diagram Blindness

All 17 open-source models perform worse on D2 (instruction ordering) with Visual input than with Text Only — they struggle to read diagrams they were supposedly trained on. Text descriptions consistently rescue comprehension.

⚖️

Text Helps Comprehension, Hurts Alignment

Adding text descriptions boosts diagram understanding (D2: +24pp) but degrades cross-modal alignment (T1: −6pp). Text acts as a shortcut — models rely on text matching and attend less to visual content.

🧠

Architecture > Scale

Qwen3.5-9B (9B) outperforms InternVL3.5-38B (38B) and Gemma3-27B (27B) on cross-depiction tasks. Model family matters more than parameter count for diagram understanding.

Three-Layer Mechanistic Analysis

We probe why models fail at diagram understanding by examining three processing stages: visual encoding, language model reasoning, and attention routing.

CKA analysis showing representational gap
Layer 1 · Representation

Diagrams and Video Diverge at the ViT Level

CKA similarity between diagram and video representations is moderate at the ViT level (0.43–0.58) and drops further after the visual merger, confirming a representational gap.

Cosine similarity shift when text is added
Layer 2 · Hidden States

Text Supplants Diagram Information

When text descriptions are added, diagram influence on the LLM's prediction drops 59% while text influence increases 24%. The model switches from visual to text-mediated reasoning.

Attention routing across decoder layers
Layer 3 · Attention

Attention Shifts Away from Diagrams

When text is added, per-token attention to diagram tokens drops 52% while text tokens absorb the freed attention. The model learns to bypass diagram processing entirely.

Bottleneck analysis: visual encoder is the weak link
Layer 1 · Bottleneck

The Visual Encoder Is the Bottleneck

Cross-modal retrieval (diagram→video) fails even with strong within-modality performance, pinpointing the ViT's inability to create aligned cross-depiction representations.

Run in 3 Commands

The entire dataset is on HuggingFace — no manual setup needed.

# 1. Clone and install git clone https://github.com/Ryenhails/IKEA-Bench.git cd IKEA-Bench && pip install -r requirements.txt # 2. Download data (~300MB from HuggingFace) python setup_data.py # 3. Evaluate your model python -m ikea_bench.eval \ --model qwen3-vl-8b \ --setting baseline \ --input data/qa_benchmark.json \ --data-dir data \ --output results/qwen3-vl-8b_baseline.json

BibTeX

If you use IKEA-Bench in your research, please cite our paper.

@article{liu2026ikeabench, title={Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment}, author={Liu, Zhuchenyang and Zhang, Yao and Xiao, Yu}, journal={arXiv preprint arXiv:2604.00913}, year={2026} }