Running Evaluations

Run an Evaluation

agentv eval evals/my-eval.yaml

Results are written to .agentv/results/<timestamp>.jsonl. Each line is a JSON object with one result per test case.

Each scores[] entry includes per-grader timing:

{
  "scores": [
    {
      "name": "format_structure",
      "type": "llm-grader",
      "score": 0.9,
      "verdict": "pass",
      "assertions": [
        { "text": "clear structure", "passed": true }
      ],
      "duration_ms": 9103,
      "started_at": "2026-03-09T00:05:10.123Z",
      "ended_at": "2026-03-09T00:05:19.226Z",
      "token_usage": { "input": 2711, "output": 2535 }
    }
  ]
}

The duration_ms, started_at, and ended_at fields are present on every grader result (including code-grader), enabling per-grader bottleneck analysis.

Common Options

Override Target

Run against a different target than specified in the eval file:

agentv eval --target azure-base evals/**/*.yaml

Experiment Label

Tag a pipeline run with an experiment name to track different conditions (e.g. with vs without skills):

agentv pipeline run evals/my-eval.yaml --experiment with_skills
agentv pipeline run evals/my-eval.yaml --experiment without_skills

The experiment label is written to manifest.json and propagated to each entry in index.jsonl by pipeline bench. The eval file stays the same across experiments — what changes is the environment. Dashboards can filter and compare results by experiment.

Run Specific Test

Run a single test by ID:

agentv eval --test-id case-123 evals/my-eval.yaml

Dry Run

Test the harness flow with mock responses (does not call real providers):

agentv eval --dry-run evals/my-eval.yaml

Output to Specific File

agentv eval evals/my-eval.yaml --out results/baseline.jsonl

Trace Persistence

Export execution traces (tool calls, timing, spans) to files for debugging and analysis:

By default, AgentV writes a per-run workspace with index.jsonl as the canonical manifest for result-oriented workflows. For full-fidelity span inspection, export OTLP JSON explicitly.

# Summary-level inspection from the run manifest
agentv trace stats .agentv/results/runs/<timestamp>/index.jsonl

# Full-fidelity OTLP JSON trace (importable by OTel backends like Jaeger, Grafana)
agentv eval evals/my-eval.yaml --otel-file traces/eval.otlp.json

# Inspect the OTLP trace export
agentv trace show traces/eval.otlp.json --tree

index.jsonl contains aggregate metrics such as score, latency, cost, token usage, and summary trace counters. --otel-file writes standard OTLP JSON that can be imported into any OpenTelemetry-compatible backend.

Live OTel Export

Stream traces directly to an observability backend during evaluation using --export-otel:

# Use a backend preset (braintrust, langfuse, confident)
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust

# Include message content and tool I/O in spans (disabled by default for privacy)
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-capture-content

# Group messages into turn spans for multi-turn evaluations
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-group-turns

Braintrust

Set up your environment:

export BRAINTRUST_API_KEY=sk-...
export BRAINTRUST_PROJECT=my-project    # associates traces with a Braintrust project

Run an eval with traces sent to Braintrust:

agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-capture-content

The following environment variables control project association (at least one is required):

Variable	Format	Example
`BRAINTRUST_PROJECT`	Project name	`my-evals`
`BRAINTRUST_PROJECT_ID`	Project UUID	`proj_abc123`
`BRAINTRUST_PARENT`	Raw `x-bt-parent` header	`project_name:my-evals`

Each eval test case produces a trace with:

Root span (agentv.eval) — test ID, target, score, duration
LLM call spans (chat <model>) — model name, token usage (input/output/cached)
Tool call spans (execute_tool <name>) — tool name, arguments, results (with --otel-capture-content)
Turn spans (agentv.turn.N) — groups messages by conversation turn (with --otel-group-turns)
Evaluator events — per-grader scores attached to the root span

Langfuse

export LANGFUSE_PUBLIC_KEY=pk-...
export LANGFUSE_SECRET_KEY=sk-...
# Optional: export LANGFUSE_HOST=https://cloud.langfuse.com

agentv eval evals/my-eval.yaml --export-otel --otel-backend langfuse --otel-capture-content

Custom OTLP Endpoint

For backends not covered by presets, configure via environment variables:

export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-backend/v1/traces
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer token"

agentv eval evals/my-eval.yaml --export-otel

Workspace Modes and Finish Policy

Use workspace mode and finish policies instead of multiple conflicting booleans:

# Mode: pooled | temp | static
agentv eval evals/my-eval.yaml --workspace-mode pooled

# Static mode path
agentv eval evals/my-eval.yaml --workspace-mode static --workspace-path /path/to/workspace

# Pooled reset policy override: standard | full (CLI override)
agentv eval evals/my-eval.yaml --workspace-clean full

# Finish policy overrides: keep | cleanup (CLI)
agentv eval evals/my-eval.yaml --retain-on-success cleanup --retain-on-failure keep

Equivalent eval YAML:

workspace:
  mode: pooled           # pooled | temp | static
  path: null             # workspace path for mode=static; auto-materialised when empty/missing
  hooks:
    enabled: true        # set false to skip all hooks
    after_each:
      reset: fast        # none | fast | strict

Notes:

Pooling is default for shared workspaces with repos when mode is not specified.
mode: static (or --workspace-mode static) uses path / --workspace-path. When the path is empty or missing, the workspace is auto-materialised (template copied + repos cloned). Populated directories are reused as-is.
Static mode is incompatible with isolation: per_test.
hooks.enabled: false skips all lifecycle hooks (setup, teardown, reset).
Pool slots are managed separately (agentv workspace list|clean).

Retry Execution Errors

Re-run only the tests that had infrastructure/execution errors from a previous output:

agentv eval evals/my-eval.yaml --retry-errors .agentv/results/eval_previous.jsonl

This reads the previous JSONL, filters for executionStatus === 'execution_error', and re-runs only those test cases. Non-error results from the previous run are preserved and merged into the new output.

Execution Error Tolerance

Control whether the eval run halts on execution errors using execution.fail_on_error in the eval YAML:

execution:
  fail_on_error: false    # never halt on errors (default)
  # fail_on_error: true   # halt on first execution error

Value	Behavior
`true`	Halt immediately on first execution error
`false`	Continue despite errors (default)

When halted, remaining tests are recorded with failureReasonCode: 'error_threshold_exceeded'. With concurrency > 1, a few additional tests may complete before halting takes effect.

Suite-Level Quality Threshold

Set a per-test score threshold for the eval suite. Each test case must score at or above this value to pass. If any test scores below the threshold, the CLI exits with code 1 — useful for CI/CD quality gates.

CLI flag:

agentv eval evals/ --threshold 0.8

YAML config:

execution:
  threshold: 0.8

The CLI --threshold flag overrides the YAML value. The threshold is a number between 0 and 1 (default: 0.8). Execution errors are excluded from the count.

When active, the summary line shows how many tests met the threshold:

RESULT: PASS  (28/31 scored >= 0.8, mean: 0.927)

The threshold also controls JUnit XML pass/fail: tests with scores below the threshold are marked as <failure> in JUnit output. When no threshold is set, JUnit defaults to 0.5.

Validate Before Running

Check eval files for schema errors without executing:

agentv validate evals/my-eval.yaml

Run a Single Assertion

Run a code-grader assertion in isolation without executing a full eval suite:

agentv eval assert <name> --agent-output <text> --agent-input <text>

The command discovers the assertion script by walking up directories looking for .agentv/graders/<name>.{ts,js,mts,mjs}, then passes the input via stdin and prints the result JSON to stdout.

# Run an assertion with inline arguments
agentv eval assert rouge-score \
  --agent-output "The fox jumps over the lazy dog" \
  --agent-input "Summarise the article"

# Or pass a JSON payload file
agentv eval assert rouge-score --file result.json

The --file option reads a JSON file with { "output": "...", "input": "..." } fields.

Exit codes: 0 if score >= 0.5 (pass), 1 if score < 0.5 (fail).

This is the same interface that agent-orchestrated evals use — the EVAL.yaml transpiler emits assertions instructions for code graders so external grading agents can execute them directly.

Offline Grading

Grade existing agent sessions without re-running them. Import a transcript, then run deterministic evaluators:

# Import a Claude Code session
agentv import claude --discover latest

# Run evaluators against the imported transcript
agentv eval evals/my-eval.yaml --transcript .agentv/transcripts/claude-<id>.jsonl

See the Import tool docs for all providers and options.

Version Requirements

Declare the minimum AgentV version needed by your eval project in .agentv/config.yaml:

required_version: ">=2.12.0"

The value is a semver range using standard npm syntax (e.g., >=2.12.0, ^2.12.0, ~2.12, >=2.12.0 <3.0.0).

Condition	Interactive (TTY)	Non-interactive (CI)
Version satisfies range	Runs silently	Runs silently
Version below range	Warns + prompts to continue	Warns to stderr, continues
`--strict` flag + mismatch	Warns + exits 1	Warns + exits 1
No `required_version` set	Runs silently	Runs silently
Malformed semver range	Error + exits 1	Error + exits 1

Use --strict in CI pipelines to enforce version requirements:

agentv eval --strict evals/my-eval.yaml

Config File Defaults

Set default execution options so you don’t have to pass them on every CLI invocation. Both .agentv/config.yaml and agentv.config.ts are supported.

YAML config (`.agentv/config.yaml`)

execution:
  verbose: true
  keep_workspaces: false
  otel_file: .agentv/results/otel-{timestamp}.json

Field	CLI equivalent	Type	Default	Description
`verbose`	`--verbose`	boolean	`false`	Enable verbose logging
`keep_workspaces`	`--keep-workspaces`	boolean	`false`	Always keep temp workspaces after eval
`otel_file`	`--otel-file`	string	none	Write OTLP JSON trace to file

TypeScript config (`agentv.config.ts`)

import { defineConfig } from '@agentv/core';

export default defineConfig({
  execution: {
    verbose: true,
    keepWorkspaces: false,
    otelFile: '.agentv/results/otel-{timestamp}.json',
  },
});

The {timestamp} placeholder is replaced with an ISO-like timestamp (e.g., 2026-03-05T14-30-00-000Z) at execution time.

Precedence: CLI flags > .agentv/config.yaml > agentv.config.ts > built-in defaults.

Environment Variables

AGENTV_HOME

Override the default ~/.agentv directory for all global runtime data (workspaces, git cache, subagents, trace state, version check cache):

# Linux/macOS
export AGENTV_HOME=/data/agentv

# Windows (PowerShell)
$env:AGENTV_HOME = "D:\agentv"

# Windows (CMD)
set AGENTV_HOME=D:\agentv

When set, AgentV logs Using AGENTV_HOME: <path> on startup to confirm the override is active.

If you use a custom AGENTV_HOME on Windows for large monorepo workspaces, enable long path support:

git config --system core.longpaths true

Or set the registry key: HKLM\SYSTEM\CurrentControlSet\Control\FileSystem\LongPathsEnabled = 1

All Options

Run agentv eval --help for the full list of options including workers, timeouts, output formats, and trace dumping.