Skip to content

Running Evaluations

Terminal window
agentv eval evals/my-eval.yaml

Results are written to .agentv/results/<timestamp>.jsonl. Each line is a JSON object with one result per test case.

Each scores[] entry includes per-grader timing:

{
"scores": [
{
"name": "format_structure",
"type": "llm-grader",
"score": 0.9,
"verdict": "pass",
"assertions": [
{ "text": "clear structure", "passed": true }
],
"duration_ms": 9103,
"started_at": "2026-03-09T00:05:10.123Z",
"ended_at": "2026-03-09T00:05:19.226Z",
"token_usage": { "input": 2711, "output": 2535 }
}
]
}

The duration_ms, started_at, and ended_at fields are present on every grader result (including code-grader), enabling per-grader bottleneck analysis.

Run against a different target than specified in the eval file:

Terminal window
agentv eval --target azure-base evals/**/*.yaml

Tag a pipeline run with an experiment name to track different conditions (e.g. with vs without skills):

Terminal window
agentv pipeline run evals/my-eval.yaml --experiment with_skills
agentv pipeline run evals/my-eval.yaml --experiment without_skills

The experiment label is written to manifest.json and propagated to each entry in index.jsonl by pipeline bench. The eval file stays the same across experiments — what changes is the environment. Dashboards can filter and compare results by experiment.

Run a single test by ID:

Terminal window
agentv eval --test-id case-123 evals/my-eval.yaml

Test the harness flow with mock responses (does not call real providers):

Terminal window
agentv eval --dry-run evals/my-eval.yaml
Terminal window
agentv eval evals/my-eval.yaml --out results/baseline.jsonl

Export execution traces (tool calls, timing, spans) to files for debugging and analysis:

By default, AgentV writes a per-run workspace with index.jsonl as the canonical manifest for result-oriented workflows. For full-fidelity span inspection, export OTLP JSON explicitly.

Terminal window
# Summary-level inspection from the run manifest
agentv trace stats .agentv/results/runs/<timestamp>/index.jsonl
# Full-fidelity OTLP JSON trace (importable by OTel backends like Jaeger, Grafana)
agentv eval evals/my-eval.yaml --otel-file traces/eval.otlp.json
# Inspect the OTLP trace export
agentv trace show traces/eval.otlp.json --tree

index.jsonl contains aggregate metrics such as score, latency, cost, token usage, and summary trace counters. --otel-file writes standard OTLP JSON that can be imported into any OpenTelemetry-compatible backend.

Stream traces directly to an observability backend during evaluation using --export-otel:

Terminal window
# Use a backend preset (braintrust, langfuse, confident)
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust
# Include message content and tool I/O in spans (disabled by default for privacy)
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-capture-content
# Group messages into turn spans for multi-turn evaluations
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-group-turns

Set up your environment:

Terminal window
export BRAINTRUST_API_KEY=sk-...
export BRAINTRUST_PROJECT=my-project # associates traces with a Braintrust project

Run an eval with traces sent to Braintrust:

Terminal window
agentv eval evals/my-eval.yaml --export-otel --otel-backend braintrust --otel-capture-content

The following environment variables control project association (at least one is required):

VariableFormatExample
BRAINTRUST_PROJECTProject namemy-evals
BRAINTRUST_PROJECT_IDProject UUIDproj_abc123
BRAINTRUST_PARENTRaw x-bt-parent headerproject_name:my-evals

Each eval test case produces a trace with:

  • Root span (agentv.eval) — test ID, target, score, duration
  • LLM call spans (chat <model>) — model name, token usage (input/output/cached)
  • Tool call spans (execute_tool <name>) — tool name, arguments, results (with --otel-capture-content)
  • Turn spans (agentv.turn.N) — groups messages by conversation turn (with --otel-group-turns)
  • Evaluator events — per-grader scores attached to the root span
Terminal window
export LANGFUSE_PUBLIC_KEY=pk-...
export LANGFUSE_SECRET_KEY=sk-...
# Optional: export LANGFUSE_HOST=https://cloud.langfuse.com
agentv eval evals/my-eval.yaml --export-otel --otel-backend langfuse --otel-capture-content

For backends not covered by presets, configure via environment variables:

Terminal window
export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-backend/v1/traces
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer token"
agentv eval evals/my-eval.yaml --export-otel

Use workspace mode and finish policies instead of multiple conflicting booleans:

Terminal window
# Mode: pooled | temp | static
agentv eval evals/my-eval.yaml --workspace-mode pooled
# Static mode path
agentv eval evals/my-eval.yaml --workspace-mode static --workspace-path /path/to/workspace
# Pooled reset policy override: standard | full (CLI override)
agentv eval evals/my-eval.yaml --workspace-clean full
# Finish policy overrides: keep | cleanup (CLI)
agentv eval evals/my-eval.yaml --retain-on-success cleanup --retain-on-failure keep

Equivalent eval YAML:

workspace:
mode: pooled # pooled | temp | static
path: null # workspace path for mode=static; auto-materialised when empty/missing
hooks:
enabled: true # set false to skip all hooks
after_each:
reset: fast # none | fast | strict

Notes:

  • Pooling is default for shared workspaces with repos when mode is not specified.
  • mode: static (or --workspace-mode static) uses path / --workspace-path. When the path is empty or missing, the workspace is auto-materialised (template copied + repos cloned). Populated directories are reused as-is.
  • Static mode is incompatible with isolation: per_test.
  • hooks.enabled: false skips all lifecycle hooks (setup, teardown, reset).
  • Pool slots are managed separately (agentv workspace list|clean).

Re-run only the tests that had infrastructure/execution errors from a previous output:

Terminal window
agentv eval evals/my-eval.yaml --retry-errors .agentv/results/eval_previous.jsonl

This reads the previous JSONL, filters for executionStatus === 'execution_error', and re-runs only those test cases. Non-error results from the previous run are preserved and merged into the new output.

Control whether the eval run halts on execution errors using execution.fail_on_error in the eval YAML:

execution:
fail_on_error: false # never halt on errors (default)
# fail_on_error: true # halt on first execution error
ValueBehavior
trueHalt immediately on first execution error
falseContinue despite errors (default)

When halted, remaining tests are recorded with failureReasonCode: 'error_threshold_exceeded'. With concurrency > 1, a few additional tests may complete before halting takes effect.

Set a per-test score threshold for the eval suite. Each test case must score at or above this value to pass. If any test scores below the threshold, the CLI exits with code 1 — useful for CI/CD quality gates.

CLI flag:

Terminal window
agentv eval evals/ --threshold 0.8

YAML config:

execution:
threshold: 0.8

The CLI --threshold flag overrides the YAML value. The threshold is a number between 0 and 1 (default: 0.8). Execution errors are excluded from the count.

When active, the summary line shows how many tests met the threshold:

RESULT: PASS (28/31 scored >= 0.8, mean: 0.927)

The threshold also controls JUnit XML pass/fail: tests with scores below the threshold are marked as <failure> in JUnit output. When no threshold is set, JUnit defaults to 0.5.

Check eval files for schema errors without executing:

Terminal window
agentv validate evals/my-eval.yaml

Run a code-grader assertion in isolation without executing a full eval suite:

Terminal window
agentv eval assert <name> --agent-output <text> --agent-input <text>

The command discovers the assertion script by walking up directories looking for .agentv/graders/<name>.{ts,js,mts,mjs}, then passes the input via stdin and prints the result JSON to stdout.

Terminal window
# Run an assertion with inline arguments
agentv eval assert rouge-score \
--agent-output "The fox jumps over the lazy dog" \
--agent-input "Summarise the article"
# Or pass a JSON payload file
agentv eval assert rouge-score --file result.json

The --file option reads a JSON file with { "output": "...", "input": "..." } fields.

Exit codes: 0 if score >= 0.5 (pass), 1 if score < 0.5 (fail).

This is the same interface that agent-orchestrated evals use — the EVAL.yaml transpiler emits assertions instructions for code graders so external grading agents can execute them directly.

Grade existing agent sessions without re-running them. Import a transcript, then run deterministic evaluators:

Terminal window
# Import a Claude Code session
agentv import claude --discover latest
# Run evaluators against the imported transcript
agentv eval evals/my-eval.yaml --transcript .agentv/transcripts/claude-<id>.jsonl

See the Import tool docs for all providers and options.

Declare the minimum AgentV version needed by your eval project in .agentv/config.yaml:

required_version: ">=2.12.0"

The value is a semver range using standard npm syntax (e.g., >=2.12.0, ^2.12.0, ~2.12, >=2.12.0 <3.0.0).

ConditionInteractive (TTY)Non-interactive (CI)
Version satisfies rangeRuns silentlyRuns silently
Version below rangeWarns + prompts to continueWarns to stderr, continues
--strict flag + mismatchWarns + exits 1Warns + exits 1
No required_version setRuns silentlyRuns silently
Malformed semver rangeError + exits 1Error + exits 1

Use --strict in CI pipelines to enforce version requirements:

Terminal window
agentv eval --strict evals/my-eval.yaml

Set default execution options so you don’t have to pass them on every CLI invocation. Both .agentv/config.yaml and agentv.config.ts are supported.

execution:
verbose: true
keep_workspaces: false
otel_file: .agentv/results/otel-{timestamp}.json
FieldCLI equivalentTypeDefaultDescription
verbose--verbosebooleanfalseEnable verbose logging
keep_workspaces--keep-workspacesbooleanfalseAlways keep temp workspaces after eval
otel_file--otel-filestringnoneWrite OTLP JSON trace to file
import { defineConfig } from '@agentv/core';
export default defineConfig({
execution: {
verbose: true,
keepWorkspaces: false,
otelFile: '.agentv/results/otel-{timestamp}.json',
},
});

The {timestamp} placeholder is replaced with an ISO-like timestamp (e.g., 2026-03-05T14-30-00-000Z) at execution time.

Precedence: CLI flags > .agentv/config.yaml > agentv.config.ts > built-in defaults.

Override the default ~/.agentv directory for all global runtime data (workspaces, git cache, subagents, trace state, version check cache):

Terminal window
# Linux/macOS
export AGENTV_HOME=/data/agentv
# Windows (PowerShell)
$env:AGENTV_HOME = "D:\agentv"
# Windows (CMD)
set AGENTV_HOME=D:\agentv

When set, AgentV logs Using AGENTV_HOME: <path> on startup to confirm the override is active.

Run agentv eval --help for the full list of options including workers, timeouts, output formats, and trace dumping.