Official Project Page

WebCompass

Towards Holistic Evaluation of Web Coding for Multimodal Code Models

Benchmarking Web Coding Agents Across Multimodal Inputs and Full Development Lifecycle

WebCompass unifies text-, image-, and video-grounded web coding tasks across generation, editing, and repair, with task-aware evaluation for execution, interactivity, and aesthetics.

Xinping Lei(†), Xinyu Che(†), Junqi Xiong(†), Chenchen Zhang(†), Yukai Huang(†), Chenyu Zhou(†), Haoyang Huang, Minghao Liu, Letian Zhu, Hongyi Ye, Ken Deng, Zizheng Zhan, Han Li, Dailin Li, Yifan Yao, Ming Sun, Zhaoxiang Zhang, Jiaheng Liu(*)

Nanjing University · Kuaishou Technology · (†) Equal contribution · (*) Corresponding author

Core Design

Multimodal Task Matrix

3 Modalities x 3 Task Types, covering 7 valid task categories.

Text
Generation
Text-Guided Generation
Editing
Text-Guided Editing
Repair
Diagnostic Repair
Image
Generation
Vision-Guided Generation
Editing
Vision-Guided Editing
Repair
Visual-Diagnostic Repair
Video
Generation
Video-Guided Generation
Editing
Repair

0

Tasks

0

Task Categories

0

Modalities

0

Task Types

Main Results

Model Comparison Across Task Types and Dimensions

The principal benchmark table is moved forward for faster model comparison. Higher scores are encoded with stronger color intensity.

Aligned to 4_experiments.tex (Main Results, Task-Type Breakdown, Difficulty-Level Analysis, Error Patterns).

Closed-Source Large Language Models

Claude-Opus-4.5

Generation
RUN
SPI
DSQ
77.18
68.95
62.26
Editing
ITG
FTI
STC
71.86
65.82
60.83
Repair
RCT
ITI
RFF
48.45
85.54
65.71
Overall
67.40

Gemini-3-Pro-Preview

Generation
RUN
SPI
DSQ
74.05
55.76
64.07
Editing
ITG
FTI
STC
69.52
65.14
58.16
Repair
RCT
ITI
RFF
54.16
87.30
72.00
Overall
66.68

Gemini-3-Flash-Preview

Generation
RUN
SPI
DSQ
74.87
54.32
62.42
Editing
ITG
FTI
STC
65.95
62.35
57.21
Repair
RCT
ITI
RFF
53.18
86.84
71.65
Overall
65.42

GPT-5.2

Generation
RUN
SPI
DSQ
75.38
60.22
55.92
Editing
ITG
FTI
STC
66.97
62.70
56.63
Repair
RCT
ITI
RFF
41.24
79.33
58.70
Overall
61.90

Claude-Sonnet-4.5

Generation
RUN
SPI
DSQ
65.30
50.37
56.78
Editing
ITG
FTI
STC
60.06
53.71
45.51
Repair
RCT
ITI
RFF
40.44
80.63
61.31
Overall
57.12
Qwen3-VL Series Open-Source Large Language Models

235B-A22B-Instruct

Generation
RUN
SPI
DSQ
61.26
42.14
47.06
Editing
ITG
FTI
STC
27.74
25.48
23.53
Repair
RCT
ITI
RFF
27.30
68.87
46.88
Overall
41.14

235B-A22B-Thinking

Generation
RUN
SPI
DSQ
63.86
35.02
45.21
Editing
ITG
FTI
STC
22.15
21.67
19.06
Repair
RCT
ITI
RFF
27.02
68.74
46.28
Overall
38.78

32B-Instruct

Generation
RUN
SPI
DSQ
50.39
25.62
34.56
Editing
ITG
FTI
STC
26.96
26.62
22.78
Repair
RCT
ITI
RFF
24.67
61.93
43.27
Overall
35.20

30B-A3B-Thinking

Generation
RUN
SPI
DSQ
47.37
20.87
37.47
Editing
ITG
FTI
STC
19.82
21.21
18.20
Repair
RCT
ITI
RFF
18.08
51.85
31.31
Overall
29.58

30B-A3B-Instruct

Generation
RUN
SPI
DSQ
41.79
20.80
29.28
Editing
ITG
FTI
STC
20.57
20.97
17.93
Repair
RCT
ITI
RFF
19.32
50.71
31.35
Overall
28.08
Abstract / Overview

Why WebCompass

Evaluating web coding requires more than code correctness: success depends on runtime execution, interaction behavior, and visual quality in browser environments. WebCompass addresses this gap with a unified multimodal benchmark spanning text, image, and video inputs, and lifecycle tasks across generation, editing, and repair. The benchmark is designed for realistic front-end engineering scenarios with deterministic construction and evidence-grounded evaluation.

Aligned to 1_intro.tex (Introduction, Contributions) and 2_artifactsBench.tex (Overview).

Core Contributions

Unified lifecycle coverage across generation, editing, and repair with text/image/video inputs.

Rigorous and deterministic task construction with reverse verifiable repair annotations.

Task-aware evaluation: Agent-as-a-Judge for generation, checklist-guided LLM-as-a-Judge for editing/repair.

Three shared evaluation dimensions: Execution, Interactivity, and Aesthetics.

Realistic web engineering scenarios emphasizing multi-page behavior and interaction fidelity.

Figure 1

Overview of WebCompass

WebCompass supports three modalities and three task types, forming seven task categories across the web development lifecycle.

Takeaway: A unified benchmark view connects modalities, tasks, and evaluation dimensions.

Benchmark Design

Unified Multimodal Benchmark Across the Development Lifecycle

WebCompass integrates modalities, task types, and realistic engineering constraints into one coherent benchmark design.

Aligned to 2_artifactsBench.tex (Overview, Dataset Statistics, Task Type Descriptions) and Introduction Table 1.

Figure 2

Seven-task Performance Radar

Radar chart of model performance across seven WebCompass task categories.

3 Modalities
3 Task Types
7 Task Categories
Text
Generation

Text-Guided Generation

Input: textual specification covering content, interactions, and visual appearance. Output: a complete runnable web repository.

Image
Generation

Vision-Guided Generation

Input: screenshots (main/subpages and dynamic keyframes). Output: a repository matching visual style and interaction behavior.

Video
Generation

Video-Guided Generation

Input: interaction recording video. Output: a repository consistent with demonstrated dynamic behavior and appearance.

Text
Editing

Text-Guided Editing

Input: source repository + text instruction. Output: code patch that satisfies requirement updates without leaking implementation hints.

Image
Editing

Vision-Guided Editing

Input: source repository + screenshot + instruction. Output: code patch aligned with visual target and requested edits.

Diagnostic
Repair

Diagnostic Repair

Input: source repository + issue description. Output: repair patch that resolves defects under deterministic inverse verification.

Image
Repair

Visual-Diagnostic Repair

Input: source repository + screenshot + issue description. Output: patch that repairs both visible and underlying diagnostic issues.

Unified Lifecycle Coverage

WebCompass jointly evaluates generation, editing, and repair instead of isolating a single stage.

Multimodal Inputs

Tasks are grounded in text, image, and video inputs, matching real request channels in web engineering.

Behavior-aware Evaluation

Evaluation explicitly scores Execution, Interactivity, and Aesthetics with evidence-grounded judging protocols.

Benchmark Scale

1526 tasks

Difficulty Distribution

Easy / Medium / Hard

Generation Taxonomy

15 application domains

Evaluation Dimensions

Execution / Interactivity / Aesthetics
Task counts: Text-Guided Generation (123), Vision-Guided Generation (109), Video-Guided Generation (94), Text-Guided Editing (300), Vision-Guided Editing (300), Diagnostic Repair (300), Visual-Diagnostic Repair (300).
Benchmark Comparison

Coverage Against Prior Benchmarks

Values are aligned to Table 1 in the paper.

BenchmarkSizeEdit (#)Repair (#)TextImageVideoGenerationEditingRepairMulti-PageInteractionVisualAgentic EvalReverse Deterministic Repair
Interaction2Code504--YesYes-Yes---YesYes--
FronTalk1000--YesYes-Yes--YesYesYes--
Web-Bench1000--YesYes-Yes--YesYes---
FrontendBench148--Yes--Yes---Yes---
WebApp1K1000--Yes--Yes--Yes-Yes--
IWR-Bench113----YesYes--YesYesYes--
WebGen-Bench101--Yes--Yes--YesYes---
SWE-bench MM51734YesYes--YesYesYes----
DesignBench90066YesYes-YesYesYesYes-Yes--
WebCompass15261611YesYesYesYesYesYesYesYesYesYesYes
Method / Pipeline

From Data Construction to Task-aware Evaluation

WebCompass evaluates different task families with tailored judging paradigms while preserving shared dimensions.

Aligned to 2_artifactsBench.tex (Data Collection, Quality Control) and 3_eval.tex (Evaluation Methodology).

Data Construction

Human-in-the-loop collection, augmentation, and quality control across text/image/video sources.

Task Instantiation

Seven task categories spanning generation, editing, and repair with deterministic settings.

Task-aware Evaluation

Checklist-guided LLM-as-a-Judge for editing/repair and Agent-as-a-Judge for generation.
Figure 4

Data Construction Pipeline

Pipeline from prototype collection to deterministic task construction and quality control.

Figure 5

LLM-as-a-Judge for Editing and Repair

Checklist-guided judging pipeline for editing and repair tasks.

Figure 6

Agent-as-a-Judge for Generation

Browser-grounded interaction and evidence collection for open-ended generation.

Data Construction Pipeline

Step 1

Text-Guided Generation Collection and Query Refinement

Collect queries from WebGen-Bench, ArtifactsBench, BigCode Arena, and V0, then refine underspecified requests into structured design documents.

Dedup + ClusteringDifficulty Labeling123 Queries
Step 2

Vision-Guided Generation Augmentation

Augment screenshots with subpage captures, keyframes, and multi-page relation markers to better represent dynamic and project-level scenarios.

Screenshot AugmentationKeyframesMulti-page
Step 3

Video-Guided Generation Recording

Record interaction-rich browsing trajectories from selected V0/Figma webpages to preserve temporal behavior evidence.

Interaction TrajectoriesTemporal Signals
Step 4

Shared Prototype Pool for Editing and Repair

Build prototypes with length filtering, automatic quality scoring, human curation, and single-/multi-page expansion.

32k-64k Length BandGPT-4o Quality Filter50 Prototypes
Step 5

Deterministic Reverse Repair Construction

Inject 11 defect categories and attach exact inverse search/replace annotations to guarantee deterministic, verifiable repair targets.

Defect InjectionReverse Search/Replace11 Defect Types
Step 6

Three-layer Quality Control

Run automated validation, LLM-assisted screening, and final expert review for executability, instruction quality, and annotation consistency.

Automated ChecksLLM ScreeningHuman Curation

Evaluation Paradigms

LLM-as-a-Judge for Editing and Repair

Judge receives requirements, source repository, predicted patch, runtime logs, and before/after screenshots, then scores checklist items in structured JSON.

Agent-as-a-Judge for Generation

Agent evaluates in real browser via checklist generation, interaction execution, adaptive verification, and evidence-grounded scoring.

Execution / Interactivity / Aesthetics

Both paradigms use the same three dimensions, with safeguards including checklist immutability, selector-only adaptation, and mandatory evidence grounding.

Experimental Results

Figure-driven Results on Web Coding Agent Evaluation

Representative result figures from the experiments section.

Aligned to 4_experiments.tex (Main Results, Task-Type Breakdown, Difficulty-Level Analysis, Error Patterns).

Figure 7

Difficulty Scaling in Generation

Per-dimension generation performance across Easy, Medium, and Hard partitions.

Takeaway: Interactivity drops the fastest as generation tasks become harder.

Figure 7

Difficulty Scaling

Performance drops monotonically from Easy to Hard across generation, editing, and repair families.

Per-dimension generation performance across Easy, Medium, and Hard partitions.

Takeaway: Interactivity drops the fastest as generation tasks become harder.

Figure 10

Consistency Under Worst-of-N

Worst-of-N analysis shows that stable behavior matters more than isolated high-scoring attempts.

Takeaway: Consistency is a stronger reliability signal than one-off wins.

Figure 10

Interactivity Bottleneck

Interactivity remains the most fragile dimension under complex generation requirements.

Worst-of-N analysis shows that stable behavior matters more than isolated high-scoring attempts.

Takeaway: Consistency is a stronger reliability signal than one-off wins.

Figure 8

Editing Subtask Breakdown

Performance across 16 editing operation types with clear difficulty skew on animation-heavy edits.

Takeaway: Animation-related edits remain significantly harder than structural edits.

Figure 8

Editing Difficulty

Animation-heavy operations remain significantly harder than structure-preserving edits.

Performance across 16 editing operation types with clear difficulty skew on animation-heavy edits.

Takeaway: Animation-related edits remain significantly harder than structural edits.

Figure 9

Repair Subtask Breakdown

Repair performance across defect categories with semantic defects as persistent bottlenecks.

Takeaway: Repair quality depends on deeper intent understanding, not only syntax correction.

Figure 9

Repair Difficulty

Semantic defects demand stronger intent understanding than surface-level bug fixing.

Repair performance across defect categories with semantic defects as persistent bottlenecks.

Takeaway: Repair quality depends on deeper intent understanding, not only syntax correction.

Additional Figures

Supplementary Figures Referenced in the Paper

Secondary figures not emphasized above are collected here for completeness.

Figures sourced from Paper/figures with TODO markers where original assets are PDF-only.

Figure 11

Agent Ranking Alignment

Comparison between agent-based ranking and human ranking over generation outputs.

Figure 12

Framework-wise Comparison

Result comparison across framework subsets.

Figure 13

Difficulty Scaling in Editing

Per-dimension editing performance over increasing difficulty.

Figure 14

Difficulty Scaling in Repair

Per-dimension repair performance over increasing difficulty.

Figure 15

Patch Complexity Distribution

Patch size and complexity distributions across evaluated models.

Figure 16

Generation Error Distribution

Overall generation error distribution in evaluated model outputs.

Figure 17

Generation Errors by Input Modality

Error distribution split by text, image, and video conditioned generation.

Figure 18

Editing Error Distribution

Category-level error distribution in editing tasks.

Figure 19

Repair Error Distribution

Category-level error distribution in repair tasks.

Insights

Key Takeaways for Web Coding Agent Development

Interpretation-focused insights distilled from result trends and error analyses.

Animation-heavy Editing is Hardest

Parallax scrolling, page transitions, and particle effects are consistently harder than business-scenario operations.

Semantic Defects are Hardest in Repair

Semantic Error is the lowest-scoring repair category, indicating difficulty in intent-level understanding beyond local patching.

Consistency Matters More Than Isolated Wins

Harmonic-mean aggregation penalizes unstable low outliers, so robust cross-subtask consistency is more valuable than occasional peaks.

Limitations

Scope, Constraints, and Future Directions

Limitations are presented to clarify scope and inform follow-up benchmark design.

Aligned to 7_limit.tex (Limitations) and 6_conclusion.tex (Conclusion).

Front-end Focus

WebCompass currently targets front-end development (HTML/CSS/JavaScript and front-end frameworks) and does not yet cover back-end or deployment workflows.

Structured Queries vs. Creative Intent

Structured design documents improve determinism and reproducibility, but they emphasize instruction-following more than open-ended creative divergence.

Limited Real-time Evaluation for Highly Dynamic Pages

Time-sensitive behaviors in rapidly changing pages (e.g., games and highly dynamic state transitions) remain challenging for current automated protocols.

Static Benchmark and Contamination Risk

As a static benchmark, long-term contamination risk remains possible and may require periodic benchmark refresh or dynamic task generation.

Evaluation Cost

Agent-as-a-Judge involves browser execution, interaction loops, and iterative test synthesis, making evaluation more computationally expensive.

Conclusion

Toward More Faithful Evaluation of Web Coding Agents

WebCompass emphasizes realistic, multimodal, and lifecycle-aware evaluation for future research.

WebCompass unifies multimodal inputs and lifecycle tasks under evidence-grounded evaluation, positioning web coding agents as holistic builders of user-facing experiences rather than code-only generators.

Citation

Use and Extend WebCompass

If WebCompass is useful for your research, please cite the paper and explore project resources.

Aligned to paper metadata placeholders; replace with final camera-ready bibliography and links.

BibTeX

@article{webcompass2026,
  title   = {WebCompass: A Unified Multimodal Benchmark and Evaluation Framework for Web Coding},
  author  = {Author A and Author B and Author C},
  journal = {arXiv preprint arXiv:TODO},
  year    = {2026},
  url     = {https://arxiv.org/abs/TODO}
}