Official Project Page
WebCompass
Towards Holistic Evaluation of Web Coding for Multimodal Code Models
Benchmarking Web Coding Agents Across Multimodal Inputs and Full Development Lifecycle
WebCompass unifies text-, image-, and video-grounded web coding tasks across generation, editing, and repair, with task-aware evaluation for execution, interactivity, and aesthetics.
Xinping Lei(†), Xinyu Che(†), Junqi Xiong(†), Chenchen Zhang(†), Yukai Huang(†), Chenyu Zhou(†), Haoyang Huang, Minghao Liu, Letian Zhu, Hongyi Ye, Ken Deng, Zizheng Zhan, Han Li, Dailin Li, Yifan Yao, Ming Sun, Zhaoxiang Zhang, Jiaheng Liu(*)
Nanjing University · Kuaishou Technology · (†) Equal contribution · (*) Corresponding author
Core Design
Multimodal Task Matrix
3 Modalities x 3 Task Types, covering 7 valid task categories.
0
Tasks
0
Task Categories
0
Modalities
0
Task Types
Model Comparison Across Task Types and Dimensions
The principal benchmark table is moved forward for faster model comparison. Higher scores are encoded with stronger color intensity.
Aligned to 4_experiments.tex (Main Results, Task-Type Breakdown, Difficulty-Level Analysis, Error Patterns).
| Model | Generation | Editing | Repair | Overall | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| RUN | SPI | DSQ | ITG | FTI | STC | RCT | ITI | RFF | ||
| Closed-Source Large Language Models | ||||||||||
| Claude-Opus-4.5 | 77.18 | 68.95 | 62.26 | 71.86 | 65.82 | 60.83 | 48.45 | 85.54 | 65.71 | 67.40 |
| Gemini-3-Pro-Preview | 74.05 | 55.76 | 64.07 | 69.52 | 65.14 | 58.16 | 54.16 | 87.30 | 72.00 | 66.68 |
| Gemini-3-Flash-Preview | 74.87 | 54.32 | 62.42 | 65.95 | 62.35 | 57.21 | 53.18 | 86.84 | 71.65 | 65.42 |
| GPT-5.2 | 75.38 | 60.22 | 55.92 | 66.97 | 62.70 | 56.63 | 41.24 | 79.33 | 58.70 | 61.90 |
| Claude-Sonnet-4.5 | 65.30 | 50.37 | 56.78 | 60.06 | 53.71 | 45.51 | 40.44 | 80.63 | 61.31 | 57.12 |
| Qwen3-VL Series Open-Source Large Language Models | ||||||||||
| 235B-A22B-Instruct | 61.26 | 42.14 | 47.06 | 27.74 | 25.48 | 23.53 | 27.30 | 68.87 | 46.88 | 41.14 |
| 235B-A22B-Thinking | 63.86 | 35.02 | 45.21 | 22.15 | 21.67 | 19.06 | 27.02 | 68.74 | 46.28 | 38.78 |
| 32B-Instruct | 50.39 | 25.62 | 34.56 | 26.96 | 26.62 | 22.78 | 24.67 | 61.93 | 43.27 | 35.20 |
| 30B-A3B-Thinking | 47.37 | 20.87 | 37.47 | 19.82 | 21.21 | 18.20 | 18.08 | 51.85 | 31.31 | 29.58 |
| 30B-A3B-Instruct | 41.79 | 20.80 | 29.28 | 20.57 | 20.97 | 17.93 | 19.32 | 50.71 | 31.35 | 28.08 |
Gemini-3-Pro-Preview
Gemini-3-Flash-Preview
GPT-5.2
Claude-Sonnet-4.5
235B-A22B-Instruct
235B-A22B-Thinking
32B-Instruct
30B-A3B-Thinking
30B-A3B-Instruct
Why WebCompass
Evaluating web coding requires more than code correctness: success depends on runtime execution, interaction behavior, and visual quality in browser environments. WebCompass addresses this gap with a unified multimodal benchmark spanning text, image, and video inputs, and lifecycle tasks across generation, editing, and repair. The benchmark is designed for realistic front-end engineering scenarios with deterministic construction and evidence-grounded evaluation.
Aligned to 1_intro.tex (Introduction, Contributions) and 2_artifactsBench.tex (Overview).
Core Contributions
Unified lifecycle coverage across generation, editing, and repair with text/image/video inputs.
Rigorous and deterministic task construction with reverse verifiable repair annotations.
Task-aware evaluation: Agent-as-a-Judge for generation, checklist-guided LLM-as-a-Judge for editing/repair.
Three shared evaluation dimensions: Execution, Interactivity, and Aesthetics.
Realistic web engineering scenarios emphasizing multi-page behavior and interaction fidelity.
Overview of WebCompass
WebCompass supports three modalities and three task types, forming seven task categories across the web development lifecycle.
Takeaway: A unified benchmark view connects modalities, tasks, and evaluation dimensions.
Unified Multimodal Benchmark Across the Development Lifecycle
WebCompass integrates modalities, task types, and realistic engineering constraints into one coherent benchmark design.
Aligned to 2_artifactsBench.tex (Overview, Dataset Statistics, Task Type Descriptions) and Introduction Table 1.
Seven-task Performance Radar
Radar chart of model performance across seven WebCompass task categories.
Text-Guided Generation
Input: textual specification covering content, interactions, and visual appearance. Output: a complete runnable web repository.
Vision-Guided Generation
Input: screenshots (main/subpages and dynamic keyframes). Output: a repository matching visual style and interaction behavior.
Video-Guided Generation
Input: interaction recording video. Output: a repository consistent with demonstrated dynamic behavior and appearance.
Text-Guided Editing
Input: source repository + text instruction. Output: code patch that satisfies requirement updates without leaking implementation hints.
Vision-Guided Editing
Input: source repository + screenshot + instruction. Output: code patch aligned with visual target and requested edits.
Diagnostic Repair
Input: source repository + issue description. Output: repair patch that resolves defects under deterministic inverse verification.
Visual-Diagnostic Repair
Input: source repository + screenshot + issue description. Output: patch that repairs both visible and underlying diagnostic issues.
Unified Lifecycle Coverage
WebCompass jointly evaluates generation, editing, and repair instead of isolating a single stage.
Multimodal Inputs
Tasks are grounded in text, image, and video inputs, matching real request channels in web engineering.
Behavior-aware Evaluation
Evaluation explicitly scores Execution, Interactivity, and Aesthetics with evidence-grounded judging protocols.
Benchmark Scale
Difficulty Distribution
Generation Taxonomy
Evaluation Dimensions
Coverage Against Prior Benchmarks
Values are aligned to Table 1 in the paper.
| Benchmark | Size | Edit (#) | Repair (#) | Text | Image | Video | Generation | Editing | Repair | Multi-Page | Interaction | Visual | Agentic Eval | Reverse Deterministic Repair |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Interaction2Code | 504 | - | - | Yes | Yes | - | Yes | - | - | - | Yes | Yes | - | - |
| FronTalk | 1000 | - | - | Yes | Yes | - | Yes | - | - | Yes | Yes | Yes | - | - |
| Web-Bench | 1000 | - | - | Yes | Yes | - | Yes | - | - | Yes | Yes | - | - | - |
| FrontendBench | 148 | - | - | Yes | - | - | Yes | - | - | - | Yes | - | - | - |
| WebApp1K | 1000 | - | - | Yes | - | - | Yes | - | - | Yes | - | Yes | - | - |
| IWR-Bench | 113 | - | - | - | - | Yes | Yes | - | - | Yes | Yes | Yes | - | - |
| WebGen-Bench | 101 | - | - | Yes | - | - | Yes | - | - | Yes | Yes | - | - | - |
| SWE-bench MM | 517 | 3 | 4 | Yes | Yes | - | - | Yes | Yes | Yes | - | - | - | - |
| DesignBench | 900 | 6 | 6 | Yes | Yes | - | Yes | Yes | Yes | Yes | - | Yes | - | - |
| WebCompass | 1526 | 16 | 11 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
From Data Construction to Task-aware Evaluation
WebCompass evaluates different task families with tailored judging paradigms while preserving shared dimensions.
Aligned to 2_artifactsBench.tex (Data Collection, Quality Control) and 3_eval.tex (Evaluation Methodology).
Data Construction
Task Instantiation
Task-aware Evaluation
Data Construction Pipeline
Pipeline from prototype collection to deterministic task construction and quality control.
LLM-as-a-Judge for Editing and Repair
Checklist-guided judging pipeline for editing and repair tasks.
Agent-as-a-Judge for Generation
Browser-grounded interaction and evidence collection for open-ended generation.
Data Construction Pipeline
Text-Guided Generation Collection and Query Refinement
Collect queries from WebGen-Bench, ArtifactsBench, BigCode Arena, and V0, then refine underspecified requests into structured design documents.
Vision-Guided Generation Augmentation
Augment screenshots with subpage captures, keyframes, and multi-page relation markers to better represent dynamic and project-level scenarios.
Video-Guided Generation Recording
Record interaction-rich browsing trajectories from selected V0/Figma webpages to preserve temporal behavior evidence.
Shared Prototype Pool for Editing and Repair
Build prototypes with length filtering, automatic quality scoring, human curation, and single-/multi-page expansion.
Deterministic Reverse Repair Construction
Inject 11 defect categories and attach exact inverse search/replace annotations to guarantee deterministic, verifiable repair targets.
Three-layer Quality Control
Run automated validation, LLM-assisted screening, and final expert review for executability, instruction quality, and annotation consistency.
Evaluation Paradigms
LLM-as-a-Judge for Editing and Repair
Judge receives requirements, source repository, predicted patch, runtime logs, and before/after screenshots, then scores checklist items in structured JSON.
Agent-as-a-Judge for Generation
Agent evaluates in real browser via checklist generation, interaction execution, adaptive verification, and evidence-grounded scoring.
Execution / Interactivity / Aesthetics
Both paradigms use the same three dimensions, with safeguards including checklist immutability, selector-only adaptation, and mandatory evidence grounding.
Figure-driven Results on Web Coding Agent Evaluation
Representative result figures from the experiments section.
Aligned to 4_experiments.tex (Main Results, Task-Type Breakdown, Difficulty-Level Analysis, Error Patterns).
Difficulty Scaling in Generation
Per-dimension generation performance across Easy, Medium, and Hard partitions.
Takeaway: Interactivity drops the fastest as generation tasks become harder.
Difficulty Scaling
Performance drops monotonically from Easy to Hard across generation, editing, and repair families.
Per-dimension generation performance across Easy, Medium, and Hard partitions.
Takeaway: Interactivity drops the fastest as generation tasks become harder.
Consistency Under Worst-of-N
Worst-of-N analysis shows that stable behavior matters more than isolated high-scoring attempts.
Takeaway: Consistency is a stronger reliability signal than one-off wins.
Interactivity Bottleneck
Interactivity remains the most fragile dimension under complex generation requirements.
Worst-of-N analysis shows that stable behavior matters more than isolated high-scoring attempts.
Takeaway: Consistency is a stronger reliability signal than one-off wins.
Editing Subtask Breakdown
Performance across 16 editing operation types with clear difficulty skew on animation-heavy edits.
Takeaway: Animation-related edits remain significantly harder than structural edits.
Editing Difficulty
Animation-heavy operations remain significantly harder than structure-preserving edits.
Performance across 16 editing operation types with clear difficulty skew on animation-heavy edits.
Takeaway: Animation-related edits remain significantly harder than structural edits.
Repair Subtask Breakdown
Repair performance across defect categories with semantic defects as persistent bottlenecks.
Takeaway: Repair quality depends on deeper intent understanding, not only syntax correction.
Repair Difficulty
Semantic defects demand stronger intent understanding than surface-level bug fixing.
Repair performance across defect categories with semantic defects as persistent bottlenecks.
Takeaway: Repair quality depends on deeper intent understanding, not only syntax correction.
Supplementary Figures Referenced in the Paper
Secondary figures not emphasized above are collected here for completeness.
Figures sourced from Paper/figures with TODO markers where original assets are PDF-only.
Agent Ranking Alignment
Comparison between agent-based ranking and human ranking over generation outputs.
Framework-wise Comparison
Result comparison across framework subsets.
Difficulty Scaling in Editing
Per-dimension editing performance over increasing difficulty.
Difficulty Scaling in Repair
Per-dimension repair performance over increasing difficulty.
Patch Complexity Distribution
Patch size and complexity distributions across evaluated models.
Generation Error Distribution
Overall generation error distribution in evaluated model outputs.
Generation Errors by Input Modality
Error distribution split by text, image, and video conditioned generation.
Editing Error Distribution
Category-level error distribution in editing tasks.
Repair Error Distribution
Category-level error distribution in repair tasks.
Key Takeaways for Web Coding Agent Development
Interpretation-focused insights distilled from result trends and error analyses.
Animation-heavy Editing is Hardest
Parallax scrolling, page transitions, and particle effects are consistently harder than business-scenario operations.
Semantic Defects are Hardest in Repair
Semantic Error is the lowest-scoring repair category, indicating difficulty in intent-level understanding beyond local patching.
Consistency Matters More Than Isolated Wins
Harmonic-mean aggregation penalizes unstable low outliers, so robust cross-subtask consistency is more valuable than occasional peaks.
Scope, Constraints, and Future Directions
Limitations are presented to clarify scope and inform follow-up benchmark design.
Aligned to 7_limit.tex (Limitations) and 6_conclusion.tex (Conclusion).
Front-end Focus
WebCompass currently targets front-end development (HTML/CSS/JavaScript and front-end frameworks) and does not yet cover back-end or deployment workflows.
Structured Queries vs. Creative Intent
Structured design documents improve determinism and reproducibility, but they emphasize instruction-following more than open-ended creative divergence.
Limited Real-time Evaluation for Highly Dynamic Pages
Time-sensitive behaviors in rapidly changing pages (e.g., games and highly dynamic state transitions) remain challenging for current automated protocols.
Static Benchmark and Contamination Risk
As a static benchmark, long-term contamination risk remains possible and may require periodic benchmark refresh or dynamic task generation.
Evaluation Cost
Agent-as-a-Judge involves browser execution, interaction loops, and iterative test synthesis, making evaluation more computationally expensive.
Toward More Faithful Evaluation of Web Coding Agents
WebCompass emphasizes realistic, multimodal, and lifecycle-aware evaluation for future research.
WebCompass unifies multimodal inputs and lifecycle tasks under evidence-grounded evaluation, positioning web coding agents as holistic builders of user-facing experiences rather than code-only generators.
Use and Extend WebCompass
If WebCompass is useful for your research, please cite the paper and explore project resources.
Aligned to paper metadata placeholders; replace with final camera-ready bibliography and links.
BibTeX
@article{webcompass2026,
title = {WebCompass: A Unified Multimodal Benchmark and Evaluation Framework for Web Coding},
author = {Author A and Author B and Author C},
journal = {arXiv preprint arXiv:TODO},
year = {2026},
url = {https://arxiv.org/abs/TODO}
}