Official Project Page

WebCompass

Towards Holistic Evaluation of Web Coding for Multimodal Code Models

Benchmarking Web Coding Agents Across Multimodal Inputs and Full Development Lifecycle

WebCompass unifies text-, image-, and video-grounded web coding tasks across generation, editing, and repair, with task-aware evaluation for execution, interactivity, and aesthetics.

Xinping Lei(†), Xinyu Che(†), Junqi Xiong(†), Chenchen Zhang(†), Yukai Huang(†), Chenyu Zhou(†), Haoyang Huang, Minghao Liu, Letian Zhu, Hongyi Ye, Jinhua Hao, Ken Deng, Zizheng Zhan, Han Li, Dailin Li, Yifan Yao, Ming Sun, Zhaoxiang Zhang, Jiaheng Liu(*)

Nanjing University · Kuaishou Technology · (†) Equal contribution · (*) Corresponding author

Contact: liujiaheng@nju.edu.cn; l1874493887@gmail.com

Paper GitHub Hugging Face Citation arXiv Badge (Placeholder)

Core Design

Multimodal Task Matrix

3 Modalities x 3 Task Types, covering 7 valid task categories.

Generation

Editing

Repair

Text

Text-Guided Generation

Text-Guided Editing

Diagnostic Repair

Image

Vision-Guided Generation

Vision-Guided Editing

Visual-Diagnostic Repair

Video

Video-Guided Generation

Text

Generation

Text-Guided Generation

Editing

Text-Guided Editing

Repair

Diagnostic Repair

Image

Generation

Vision-Guided Generation

Editing

Vision-Guided Editing

Repair

Visual-Diagnostic Repair

Video

Generation

Video-Guided Generation

Editing

Repair

Tasks

Task Categories

Modalities

Task Types

Main Results

Model Comparison Across Task Types and Dimensions

The principal benchmark table is moved forward for faster model comparison. Higher scores are encoded with stronger color intensity.

Aligned to 4_experiments.tex (Main Results, Task-Type Breakdown, Difficulty-Level Analysis, Error Patterns).

Model	Generation			Editing			Repair			Overall
Model	RUN	SPI	DSQ	ITG	FTI	STC	RCT	ITI	RFF	Overall
Closed-Source Large Language Models
Claude-Opus-4.5	77.18	68.95	62.26	71.86	65.82	60.83	48.45	85.54	65.71	67.40
Gemini-3-Pro-Preview	74.05	55.76	64.07	69.52	65.14	58.16	54.16	87.30	72.00	66.68
Gemini-3-Flash-Preview	74.87	54.32	62.42	65.95	62.35	57.21	53.18	86.84	71.65	65.42
GPT-5.2	75.38	60.22	55.92	66.97	62.70	56.63	41.24	79.33	58.70	61.90
Claude-Sonnet-4.5	65.30	50.37	56.78	60.06	53.71	45.51	40.44	80.63	61.31	57.12
Qwen3-VL Series Open-Source Large Language Models
235B-A22B-Instruct	61.26	42.14	47.06	27.74	25.48	23.53	27.30	68.87	46.88	41.14
235B-A22B-Thinking	63.86	35.02	45.21	22.15	21.67	19.06	27.02	68.74	46.28	38.78
32B-Instruct	50.39	25.62	34.56	26.96	26.62	22.78	24.67	61.93	43.27	35.20
30B-A3B-Thinking	47.37	20.87	37.47	19.82	21.21	18.20	18.08	51.85	31.31	29.58
30B-A3B-Instruct	41.79	20.80	29.28	20.57	20.97	17.93	19.32	50.71	31.35	28.08

Closed-Source Large Language Models

Claude-Opus-4.5

Generation

RUN

SPI

DSQ

77.18

68.95

62.26

Editing

ITG

FTI

STC

71.86

65.82

60.83

Repair

RCT

ITI

RFF

48.45

85.54

65.71

Overall

67.40

Gemini-3-Pro-Preview

Generation

RUN

SPI

DSQ

74.05

55.76

64.07

Editing

ITG

FTI

STC

69.52

65.14

58.16

Repair

RCT

ITI

RFF

54.16

87.30

72.00

Overall

66.68

Gemini-3-Flash-Preview

Generation

RUN

SPI

DSQ

74.87

54.32

62.42

Editing

ITG

FTI

STC

65.95

62.35

57.21

Repair

RCT

ITI

RFF

53.18

86.84

71.65

Overall

65.42

GPT-5.2

Generation

RUN

SPI

DSQ

75.38

60.22

55.92

Editing

ITG

FTI

STC

66.97

62.70

56.63

Repair

RCT

ITI

RFF

41.24

79.33

58.70

Overall

61.90

Claude-Sonnet-4.5

Generation

RUN

SPI

DSQ

65.30

50.37

56.78

Editing

ITG

FTI

STC

60.06

53.71

45.51

Repair

RCT

ITI

RFF

40.44

80.63

61.31

Overall

57.12

Qwen3-VL Series Open-Source Large Language Models

235B-A22B-Instruct

Generation

RUN

SPI

DSQ

61.26

42.14

47.06

Editing

ITG

FTI

STC

27.74

25.48

23.53

Repair

RCT

ITI

RFF

27.30

68.87

46.88

Overall

41.14

235B-A22B-Thinking

Generation

RUN

SPI

DSQ

63.86

35.02

45.21

Editing

ITG

FTI

STC

22.15

21.67

19.06

Repair

RCT

ITI

RFF

27.02

68.74

46.28

Overall

38.78

32B-Instruct

Generation

RUN

SPI

DSQ

50.39

25.62

34.56

Editing

ITG

FTI

STC

26.96

26.62

22.78

Repair

RCT

ITI

RFF

24.67

61.93

43.27

Overall

35.20

30B-A3B-Thinking

Generation

RUN

SPI

DSQ

47.37

20.87

37.47

Editing

ITG

FTI

STC

19.82

21.21

18.20

Repair

RCT

ITI

RFF

18.08

51.85

31.31

Overall

29.58

30B-A3B-Instruct

Generation

RUN

SPI

DSQ

41.79

20.80

29.28

Editing

ITG

FTI

STC

20.57

20.97

17.93

Repair

RCT

ITI

RFF

19.32

50.71

31.35

Overall

28.08

Abstract / Overview

Why WebCompass

Evaluating web coding requires more than code correctness: success depends on runtime execution, interaction behavior, and visual quality in browser environments. WebCompass addresses this gap with a unified multimodal benchmark spanning text, image, and video inputs, and lifecycle tasks across generation, editing, and repair. The benchmark is designed for realistic front-end engineering scenarios with deterministic construction and evidence-grounded evaluation.

Aligned to 1_intro.tex (Introduction, Contributions) and 2_artifactsBench.tex (Overview).

Core Contributions

Unified lifecycle coverage across generation, editing, and repair with text/image/video inputs.

Rigorous and deterministic task construction with reverse verifiable repair annotations.

Task-aware evaluation: Agent-as-a-Judge for generation, checklist-guided LLM-as-a-Judge for editing/repair.

Three shared evaluation dimensions: Execution, Interactivity, and Aesthetics.

Realistic web engineering scenarios emphasizing multi-page behavior and interaction fidelity.

Figure 1

Overview of WebCompass

WebCompass supports three modalities and three task types, forming seven task categories across the web development lifecycle.

Takeaway: A unified benchmark view connects modalities, tasks, and evaluation dimensions.

Benchmark Design

Unified Multimodal Benchmark Across the Development Lifecycle

WebCompass integrates modalities, task types, and realistic engineering constraints into one coherent benchmark design.

Aligned to 2_artifactsBench.tex (Overview, Dataset Statistics, Task Type Descriptions) and Introduction Table 1.

Figure 2

Seven-task Performance Radar

Radar chart of model performance across seven WebCompass task categories.

3 Modalities

3 Task Types

7 Task Categories

Text

Generation

Text-Guided Generation

Input: textual specification covering content, interactions, and visual appearance. Output: a complete runnable web repository.

Image

Generation

Vision-Guided Generation

Input: screenshots (main/subpages and dynamic keyframes). Output: a repository matching visual style and interaction behavior.

Video

Generation

Video-Guided Generation

Input: interaction recording video. Output: a repository consistent with demonstrated dynamic behavior and appearance.

Text

Editing

Text-Guided Editing

Input: source repository + text instruction. Output: code patch that satisfies requirement updates without leaking implementation hints.

Image

Editing

Vision-Guided Editing

Input: source repository + screenshot + instruction. Output: code patch aligned with visual target and requested edits.

Diagnostic

Repair

Diagnostic Repair

Input: source repository + issue description. Output: repair patch that resolves defects under deterministic inverse verification.

Image

Repair

Visual-Diagnostic Repair

Input: source repository + screenshot + issue description. Output: patch that repairs both visible and underlying diagnostic issues.

Unified Lifecycle Coverage

WebCompass jointly evaluates generation, editing, and repair instead of isolating a single stage.

Multimodal Inputs

Tasks are grounded in text, image, and video inputs, matching real request channels in web engineering.

Behavior-aware Evaluation

Evaluation explicitly scores Execution, Interactivity, and Aesthetics with evidence-grounded judging protocols.

Benchmark Scale

1526 tasks

Difficulty Distribution

Easy / Medium / Hard

Generation Taxonomy

15 application domains

Evaluation Dimensions

Execution / Interactivity / Aesthetics

Task counts: Text-Guided Generation (123), Vision-Guided Generation (109), Video-Guided Generation (94), Text-Guided Editing (300), Vision-Guided Editing (300), Diagnostic Repair (300), Visual-Diagnostic Repair (300).

Benchmark Comparison

Coverage Against Prior Benchmarks

Values are aligned to Table 1 in the paper.

Benchmark	Size	Edit (#)	Repair (#)
Interaction2Code	504	-	-
FronTalk	1000	-	-
Web-Bench	1000	-	-
FrontendBench	148	-	-
WebApp1K	1000	-	-
IWR-Bench	113	-	-
WebGen-Bench	101	-	-
SWE-bench MM	517	3	4
DesignBench	900	6	6
WebCompass	1526	16	11

Method / Pipeline

From Data Construction to Task-aware Evaluation

WebCompass evaluates different task families with tailored judging paradigms while preserving shared dimensions.

Aligned to 2_artifactsBench.tex (Data Collection, Quality Control) and 3_eval.tex (Evaluation Methodology).

Data Construction

Human-in-the-loop collection, augmentation, and quality control across text/image/video sources.

Task Instantiation

Seven task categories spanning generation, editing, and repair with deterministic settings.

Task-aware Evaluation

Checklist-guided LLM-as-a-Judge for editing/repair and Agent-as-a-Judge for generation.

Figure 4

Data Construction Pipeline

Pipeline from prototype collection to deterministic task construction and quality control.

Figure 5

LLM-as-a-Judge for Editing and Repair

Checklist-guided judging pipeline for editing and repair tasks.

Figure 6

Agent-as-a-Judge for Generation

Browser-grounded interaction and evidence collection for open-ended generation.

Data Construction Pipeline

Step 1

Text-Guided Generation Collection and Query Refinement

Collect queries from WebGen-Bench, ArtifactsBench, BigCode Arena, and V0, then refine underspecified requests into structured design documents.

Dedup + ClusteringDifficulty Labeling123 Queries

Step 2

Vision-Guided Generation Augmentation

Augment screenshots with subpage captures, keyframes, and multi-page relation markers to better represent dynamic and project-level scenarios.

Screenshot AugmentationKeyframesMulti-page

Step 3

Video-Guided Generation Recording

Record interaction-rich browsing trajectories from selected V0/Figma webpages to preserve temporal behavior evidence.

Interaction TrajectoriesTemporal Signals

Step 4

Shared Prototype Pool for Editing and Repair

Build prototypes with length filtering, automatic quality scoring, human curation, and single-/multi-page expansion.

32k-64k Length BandGPT-4o Quality Filter50 Prototypes

Step 5

Deterministic Reverse Repair Construction

Inject 11 defect categories and attach exact inverse search/replace annotations to guarantee deterministic, verifiable repair targets.

Defect InjectionReverse Search/Replace11 Defect Types

Step 6

Three-layer Quality Control

Run automated validation, LLM-assisted screening, and final expert review for executability, instruction quality, and annotation consistency.

Automated ChecksLLM ScreeningHuman Curation

Evaluation Paradigms

LLM-as-a-Judge for Editing and Repair

Judge receives requirements, source repository, predicted patch, runtime logs, and before/after screenshots, then scores checklist items in structured JSON.

Agent-as-a-Judge for Generation

Agent evaluates in real browser via checklist generation, interaction execution, adaptive verification, and evidence-grounded scoring.

Execution / Interactivity / Aesthetics

Both paradigms use the same three dimensions, with safeguards including checklist immutability, selector-only adaptation, and mandatory evidence grounding.

Experimental Results

Figure-driven Results on Web Coding Agent Evaluation

Representative result figures from the experiments section.

Aligned to 4_experiments.tex (Main Results, Task-Type Breakdown, Difficulty-Level Analysis, Error Patterns).

Figure 7

Judge Ranking Alignment

Automatic agent-based ranking is closely aligned with human ranking across major model families.

Figure 8

Editing Subtask Breakdown

Animation-heavy operations remain significantly harder than structure-preserving edits. The gap is most visible in tasks requiring coordinated motion timing, transition continuity, and multi-element synchronization, where otherwise strong models still show unstable behavior.

Figure 9

Repair Subtask Breakdown

Semantic defects demand stronger intent understanding than surface-level bug fixing. Compared with syntactic and style-level repairs, semantic corrections require models to infer page-level goals and preserve interaction logic across components, which remains a persistent failure mode.

Figure 10

Consistency Under Worst-of-N

Worst-of-N analysis shows that consistency is a stronger reliability signal than isolated wins. Even when a model can occasionally produce high-scoring outputs, robustness drops when evaluated over repeated runs, indicating that stable performance under heterogeneous constraints is still challenging.

Additional Figures

Supplementary Figures Referenced in the Paper

Secondary figures not emphasized above are collected here for completeness.

Figures sourced from Paper/figures with TODO markers where original assets are PDF-only.

Figure 11

Task-level Performance by Difficulty

Performance comparison across Generation, Editing, and Repair tasks under Easy, Medium, and Hard settings.

Figure 12

Generation Dimensions by Difficulty

Generation dimensions (Runnability, Spec Implementation, Design Quality) across difficulty tiers.

Figure 13

Editing Dimensions by Difficulty

Editing dimensions (Instruction Targeting, Feature Integrity, Style Conformance) across difficulty tiers.

Figure 14

Repair Dimensions by Difficulty

Repair dimensions (Root-Cause Targeting, Interaction Integrity, Reference Fidelity) across difficulty tiers.

Figure 15

Framework-wise Comparison

Result comparison across framework subsets.

Figure 16

Patch Complexity Distribution

Patch size and complexity distributions across evaluated models.

Figure 17

Generation Error Distribution

Overall generation error distribution in evaluated model outputs.

Figure 18

Generation Errors by Input Modality

Error distribution split by text, image, and video conditioned generation.

Figure 19

Editing Error Distribution

Category-level error distribution in editing tasks.

Figure 20

Repair Error Distribution

Category-level error distribution in repair tasks.

Insights

Key Takeaways for Web Coding Agent Development

Interpretation-focused insights distilled from result trends and error analyses.

Animation-heavy Editing is Hardest

Parallax scrolling, page transitions, and particle effects are consistently harder than business-scenario operations.

Semantic Defects are Hardest in Repair

Semantic Error is the lowest-scoring repair category, indicating difficulty in intent-level understanding beyond local patching.

Consistency Matters More Than Isolated Wins

Harmonic-mean aggregation penalizes unstable low outliers, so robust cross-subtask consistency is more valuable than occasional peaks.

Limitations

Scope, Constraints, and Future Directions

Limitations are presented to clarify scope and inform follow-up benchmark design.

Aligned to 7_limit.tex (Limitations) and 6_conclusion.tex (Conclusion).

Front-end Focus

WebCompass currently targets front-end development (HTML/CSS/JavaScript and front-end frameworks) and does not yet cover back-end or deployment workflows.

Structured Queries vs. Creative Intent

Structured design documents improve determinism and reproducibility, but they emphasize instruction-following more than open-ended creative divergence.

Limited Real-time Evaluation for Highly Dynamic Pages

Time-sensitive behaviors in rapidly changing pages (e.g., games and highly dynamic state transitions) remain challenging for current automated protocols.

Static Benchmark and Contamination Risk

As a static benchmark, long-term contamination risk remains possible and may require periodic benchmark refresh or dynamic task generation.

Evaluation Cost

Agent-as-a-Judge involves browser execution, interaction loops, and iterative test synthesis, making evaluation more computationally expensive.

Conclusion

Toward More Faithful Evaluation of Web Coding Agents

WebCompass emphasizes realistic, multimodal, and lifecycle-aware evaluation for future research.

WebCompass unifies multimodal inputs and lifecycle tasks under evidence-grounded evaluation, positioning web coding agents as holistic builders of user-facing experiences rather than code-only generators.

Citation

Use and Extend WebCompass

If WebCompass is useful for your research, please cite the paper and explore project resources.

Aligned to paper metadata placeholders; replace with final camera-ready bibliography and links.

BibTeX

@article{webcompass2026,
  title   = {WebCompass: A Unified Multimodal Benchmark and Evaluation Framework for Web Coding},
  author  = {Author A and Author B and Author C},
  journal = {arXiv preprint arXiv:TODO},
  year    = {2026},
  url     = {https://arxiv.org/abs/TODO}
}

Paper GitHub Hugging Face Dataset