Installation
git clone https://github.com/NJU-LINK/CodeTracer.git
cd CodeTracer
pip install -e .
Configure your LLM endpoint:
export CODETRACER_API_BASE="https://api.openai.com/v1"
export CODETRACER_API_KEY="your-api-key"
Quick Start
from pathlib import Path
from codetracer.query.normalizer import Normalizer
from codetracer.query.tree_builder import TreeBuilder
from codetracer.skills.pool import SkillPool
from codetracer.agents.trace_agent import TraceAgent
from codetracer.agents.context import ContextAssembler
from codetracer.llm.client import LLMClient
# 1. Normalize trajectory
pool = SkillPool()
normalizer = Normalizer(pool)
skill = normalizer.detect(Path("path/to/trajectory"))
traj = normalizer.normalize(Path("path/to/trajectory"), skill)
# 2. Build navigation tree
tree_md = TreeBuilder().build(traj)
# 3. Run diagnosis
llm = LLMClient(api_base="https://api.openai.com/v1", api_key="...", model_name="gpt-4o")
assembler = ContextAssembler(config={}, skill_pool=pool)
agent = TraceAgent(llm, assembler, Path("./work"), Path("./labels.json"), config={})
result = agent.run(skill)
TraceAgent
class High-level trace agent that wires context assembly and the base agent loop for autonomous trajectory diagnosis.
Constructor
TraceAgent(
llm: LLMClient,
assembler: ContextAssembler,
run_dir: Path,
output_path: Path,
config: dict[str, Any],
artifacts_dir: Path | None = None,
*,
hooks: HookManager | None = None,
cost_tracker: CostTracker | None = None,
compact_manager: CompactManager | None = None,
profile: OutputProfile | None = None,
agent_type: str = ""
)
Methods
- run(skill, task_ctx=None, memory_text="", budget_context="", traj_metadata=None) → str Run full analysis and return result summary.
-
run_iter(skill, task_ctx=None, memory_text="", budget_context="", traj_metadata=None)
Generator variant that yields
AgentEventobjects for streaming. - save_trajectory(path: Path) → None Save agent conversation trajectory to JSON file.
CompactManager
class Two-tier context compaction: LLM summarization with sliding-window fallback. Never permanently disables.
Constructor
CompactManager(
context_window: int = 128_000,
buffer_tokens: int = 13_000,
max_failures: int = 3,
enabled: bool = True
)
Methods
- should_compact(messages: list[dict]) → bool Check if messages exceed the token threshold.
- compact(messages: list[dict], llm) → list[dict] Summarize messages and return shorter replacement list.
Properties
- threshold: int Context window threshold in tokens.
- compact_count: int Number of compaction passes applied.
ContextAssembler
class Composes LLM messages from config templates, skill docs, and run data.
Constructor
ContextAssembler(config: dict[str, Any], skill_pool: SkillPool)
Methods
- build_trace_messages(run_dir, skill, task_ctx=None, ...) → list[dict] Build [system, user] messages for trace agent using layered composition.
- build_discovery_messages(run_dir, listing, samples) → list[dict] Build messages for skill generator agent.
Discovery Explorer
Three-phase recursive trajectory discovery with LLM-guided fallback for arbitrarily nested directories.
discover_trajectory_dirs
function
discover_trajectory_dirs(
root: Path,
config: dict[str, Any] | None = None,
llm: LLMClient | None = None
) → list[Path]
Discover trajectory directories under root using three-phase strategy:
- Marker scan — recursive walk for known markers (
results.json,steps.json, etc.) - Skill detection — validate candidates via SkillPool
- LLM analysis — when fast scan returns empty, use LLM to analyze directory structure
detect_or_generate_skill
function
detect_or_generate_skill(
run_dir: Path,
normalizer: Normalizer,
pool: SkillPool,
llm: LLMClient,
config: dict[str, Any],
user_skill_dir: Path | None = None,
format_override: str | None = None
) → tuple[Skill | None, NormalizedTrajectory]
Unified detect-or-generate. Tries: pre-normalized → step-JSONL → skill detection → auto-generation.
SkillPool
class Registry of all available trajectory parsers (seed + user-generated skills).
Constructor
SkillPool(seed_dir: Path = <built-in>, user_dir: Path | None = None)
Methods
- detect(run_dir: Path) → str | None Return name of matching skill via two-pass detection.
- get(name: str) → Skill | None Get skill by name.
- register(skill: Skill) → None Register a new skill.
- list_skills() → list[Skill] Get all registered skills.
- skill_index() → str Compact markdown index for LLM context injection.
SkillGenerator
class Uses LLM to analyze unknown trajectory formats and auto-generate parsers.
Constructor
SkillGenerator(llm: LLMClient, pool: SkillPool, config: dict[str, Any])
Methods
-
generate(run_dir: Path, user_dir: Path) → Skill
Analyze run_dir, generate SKILL.md + parser.py, register and return. Raises
RuntimeErrorafter max_attempts.
Normalizer
class Orchestrates format detection and parsing into NormalizedTrajectory.
Constructor
Normalizer(pool: SkillPool)
Methods
- is_pre_normalized(run_dir: Path) → bool True if run_dir contains steps.json.
- is_step_jsonl_dir(run_dir: Path) → bool True if run_dir contains step_N.jsonl files.
- detect(run_dir: Path, format_override: str | None = None) → Skill Return matching skill or raise ValueError.
- normalize_pre_normalized(run_dir, output_dir=None, quiet=False) → NormalizedTrajectory Load pre-normalized directory.
- normalize_step_jsonl(run_dir, output_dir=None, quiet=False) → NormalizedTrajectory Load step_N.jsonl annotation files.
- normalize(run_dir, skill, output_dir=None, quiet=False) → NormalizedTrajectory Parse using skill and write derived artifacts.
TreeBuilder
class Converts normalized trajectories into tree.md navigation indices with step classification.
Constructor
TreeBuilder(llm=None, config: dict[str, Any] | None = None)
Methods
- build(traj: NormalizedTrajectory) → str Build tree from step classification (fast, no LLM).
- build_with_llm(traj: NormalizedTrajectory) → str Build tree using LLM for richer classification labels.
- build_from_annotation(traj, annotation, run_dir=None) → str Build tree from per-step annotation labels.
Memory Service
Cross-trajectory memory with online mid-analysis extraction. Accumulates agent-specific failure patterns and investigation strategies.
OnlineMemoryExtractor
class Fire-and-forget background extraction during analysis.
OnlineMemoryExtractor(
agent_type: str,
llm: Any,
memory_dir: Path | None = None,
step_interval: int = 8,
token_threshold: int = 30_000
)
- should_extract(step: int, total_tokens: int) → bool Check whether extraction should trigger.
- extract_async(messages, step, total_tokens) → None Launch extraction in background thread.
Module Functions
- load_memory(agent_type: str, memory_dir: Path | None = None) → str Load TRACER.md memory file. Returns contents or empty string.
- update_memory(agent_type, analysis_summary, failure_patterns=None, memory_dir=None) → Path Append insights to TRACER.md with timestamp.
- auto_extract_memory(agent_type, labels_path, analysis_summary="", memory_dir=None) → Path | None One-shot post-analysis memory extraction.
- extract_failure_patterns(labels: list[dict]) → list[str] Extract short failure pattern strings from labels.
CostTracker
dataclass Tracks LLM cost across pipeline phases and enforces budget limits.
Methods
- add_usage(model, input_tokens, output_tokens, phase="trace", duration_s=0.0) → float Record usage and return incremental USD cost.
- is_over_budget() → bool Check if total cost ≥ budget limit.
- should_warn() → bool Check if warning threshold reached (80%).
- get_phase_costs() → dict[str, PhaseCost] Get per-phase cost breakdown.
- format_summary() → str Human-readable cost summary.
Properties
- total_cost: float
- budget_remaining: float
- budget_used_pct: float
ModelCosts
dataclass Per-million-token pricing.
| Field | Type | Default |
|---|---|---|
| input_per_mtok | float | 3.0 |
| output_per_mtok | float | 15.0 |
LLMClient
class OpenAI-compatible LLM client with categorized retry logic and Azure AD support.
Constructor
LLMClient(
api_base: str = "",
api_key: str = "",
model_name: str | None = None,
model_kwargs: dict = {},
azure_ad_resource: str = ""
)
Methods
-
query(messages: list[dict], **kwargs) → dict[str, Any]
Query LLM with retry. Returns
{"content": str, "usage": {...}}.
Properties
- model_name: str | None
- cost: float Accumulated cost (deprecated, use CostTracker).
- n_calls: int
- total_prompt_tokens: int
- total_completion_tokens: int
Data Models
NormalizedTrajectory
dataclass Fully normalized trajectory ready for tree building and tracing.
| Field | Type | Description |
|---|---|---|
| steps | list[StepRecord] | List of steps |
| task_description | str | Task description |
| metadata | dict | Format/run metadata |
- write_steps_json(path: Path) → None
- step_count: int (property)
StepRecord
dataclass One normalized step: an action and its observation.
| Field | Type | Description |
|---|---|---|
| step_id | int | Step index |
| action | str | Action taken |
| observation | str | None | Observation result |
| thinking | str | None | Internal reasoning |
| tool_type | str | None | Type of tool used |
| action_ref | FileRef | None | Source location reference |
| observation_ref | FileRef | None | Source location reference |
ErrorAnalysis
dataclass Result of trajectory error analysis.
| Field | Type | Description |
|---|---|---|
| traj_id | str | Trajectory identifier |
| labels | list[StepLabel] | Per-step labels |
| summary | str | Analysis summary |
| metadata | dict | Additional metadata |
- save(path: Path) → None
- load(path: Path) → ErrorAnalysis (classmethod)
- from_labels_json(path: Path, traj_id: str) → ErrorAnalysis (classmethod)
- first_incorrect_step_id: int | None (property)
StepLabel
dataclass Label for a single diagnosed step.
| Field | Type | Description |
|---|---|---|
| step_id | int | Target step |
| verdict | StepVerdict | INCORRECT | UNUSEFUL | CORRECT |
| reasoning | str | Why this verdict |
| deviation_type | str | Type of deviation |
| correct_alternative | str | What should have happened |
StepVerdict
enum INCORRECT = "incorrect" | UNUSEFUL = "unuseful" | CORRECT = "correct"
ReplayResult
dataclass Outcome of replay session.
| Field | Type | Description |
|---|---|---|
| status | ReplayStatus | SUCCESS | PARTIAL | FAILED |
| checkpoint | StepCheckpoint | None | Final checkpoint |
| steps_replayed | int | Count of steps replayed |
| agent_output | str | Agent's response |
TaskContext
dataclass Task metadata and provider reference.
| Field | Type | Description |
|---|---|---|
| bench_type | str | Benchmark type |
| task_name | str | Task name |
| task_dir | Path | Task directory |
| problem_statement | str | None | Problem description |
- load(task_dir: Path, pool=None) → TaskContext (classmethod) Auto-detect and create context.
- prepare_sandbox(target_parent: Path) → Path
- exploration_instructions(sandbox: Path) → str
Output Profiles
OutputProfile
dataclass
| Field | Type | Description |
|---|---|---|
| name | str | Profile name |
| schema_ref | str | Schema reference |
| finalize_instruction | str | Agent output format instructions |
| output_file | str | Output filename |
Built-in Profiles
tracebench → codetracer_labels.json
Stage-level labels with incorrect_step_ids and unuseful_step_ids for benchmark evaluation.
detailed → codetracer_analysis.json
Root cause chains, critical decision points, and comprehensive analysis.
rl_feedback → codetracer_rl_feedback.json
Per-step deviation analysis and reward signals for RL training.
Functions
- load_profile(name: str, config=None) → OutputProfile Load by name. Raises ValueError if unknown.
- get_default_profile_name(config=None) → str Returns default profile name ("detailed").
Plugin Adapters
Integration surface for embedding CodeTracer into external agent frameworks.
PluginAdapter
abstract Base class for framework integration.
- name() → str Unique identifier (e.g. 'openhands').
- ingest_trajectory(raw_path, **kwargs) → NormalizedTrajectory
- analyze(traj, **kwargs) → ErrorAnalysis
- replay(traj, step_id, analysis, **kwargs) → ReplayResult
- analyze_and_replay(raw_path, **kwargs) → ReplayResult Convenience: ingest → analyze → auto-replay.
GenericPluginAdapter
class Full pipeline adapter backed by CodeTracer components.
GenericPluginAdapter(
skill_name: str,
*,
bench_name: str | None = None,
config: dict | None = None,
llm_kwargs: dict | None = None
)
Built-in Adapters
| Adapter | Skill Name | Framework |
|---|---|---|
| MinisweAdapter | miniswe | MiniSWE Agent |
| OpenHandsAdapter | openhands | OpenHands |
| SweAgentAdapter | swe_agent | SWE-Agent |
CLI Reference
Usage: codetracer [COMMAND] [OPTIONS]
Commands:
analyze Run trajectory diagnosis on a run directory
run Full pipeline: detect → normalize → tree → analyze → replay
replay Resume trajectory from diagnosed breakpoint
inspect Inspect specific step or range
interactive Enter interactive REPL with menu-driven actions
normalize Normalize a trajectory to steps.json format
tree Build step classification tree
batch Run batch analysis from manifest
Global Options:
--model TEXT LLM model name
--api-base URL API endpoint
--api-key TEXT API key
--config PATH Custom configuration file
--profile TEXT Output profile (tracebench / detailed / rl_feedback)
--cost-limit $ Max LLM spend per trajectory (default: 3.0)
--dry-run Normalize + tree only, skip LLM analysis
CodeTracer © 2026 Nanjing University • Paper • GitHub • MIT License