Evaluations (Evals)

Evaluations (commonly called "evals") are automated tests that validate agent behavior and performance. Agent CI evals provide objective metrics for assessing agent functionality across different dimensions including accuracy, performance, safety, and consistency.

What are Evals?

Evals are automated testing frameworks designed specifically for AI agents and LLM applications. Whether you're running accuracy evals, performance evals, or safety evals, Agent CI provides comprehensive evaluation tools to ensure your agents behave as expected.

Overview

Agent CI evaluations run in two contexts:

CI runs: Execute on pull requests as part of the development workflow
Runtime runs: Analyze live agent interactions in production environments

Evaluation configurations are stored as TOML files in the .agentci/evals/ directory of your repository, keeping test definitions version-controlled alongside your code.

Evaluation Types

Agent CI provides six built-in eval types for comprehensive agent testing:

Accuracy Evals

Accuracy evals test whether agent outputs match expected results using exact matches, pattern matching, or JSON schema validation.

Performance Evals

Performance evals measure response time, latency, and resource usage with configurable thresholds.

Safety Evals

Safety evals validate content filtering and security resistance against prompt injection, harmful content, and other security risks. Include built-in templates for common attack vectors.

Consistency Evals

Consistency evals run identical inputs multiple times to verify low-variance outputs across executions.

LLM Evals

LLM evals use LLM-as-judge methodology with configurable scoring criteria for subjective quality measures.

Eval Configuration

Eval configurations use TOML format and are stored in the .agentci/evals/ directory. Each eval file defines:

Evaluation type and parameters
Target agents and tools to test
Test cases with inputs and expected outputs
Scoring criteria and thresholds

Example structure:

[eval]
description = "Test agent accuracy"
type = "accuracy"
targets.agents = ["customer-support"]
targets.tools = []

[[eval.cases]]
prompt = "What are your hours?"
output = "We are open {{*}} to {{*}}"

Built-in Security Eval Templates

Safety evals include pre-configured templates:

Prompt injection resistance - Standard jailbreaking attempts
SQL injection testing - Database security validation
Content filtering - Harmful content detection
PII exposure prevention - Leak detection

Execution Context

Evaluations run in two contexts:

CI Runs

Triggered on pull requests
Provide regression detection and baseline verification
Generate automated PR comments with results
Support branch protection rules

Runtime Runs

Capture live agent interactions via OpenTelemetry instrumentation
Analyze production behavior using the same evaluation framework
Support selective evaluation and percentage-based sampling

Version Control Integration

All eval configurations are stored in Git:

TOML configuration files in .agentci/evals/ directory
Custom eval code versioned with application code
Git commit hashes used as eval identifiers
Same evals run across development, staging, and production environments

Results and Analysis

Evaluation results provide:

Binary pass/fail outcomes for each test case
Detailed scoring and metrics per evaluation type
Performance trending over time
Attribution to specific commits and developers
Aggregated success rates across evaluation runs

Getting Started with Evals

Choose your eval type based on what aspect you want to test
Create a .agentci/evals/ directory in your repository root
Add TOML files for each eval (e.g., accuracy-test.toml)
Commit and open a pull request to see your evals run

Common Eval Questions

What are evals? Evals (evaluations) are automated tests for AI agents that measure accuracy, performance, safety, and consistency.

Why use evals for AI agents? Evals provide objective, reproducible measurements of agent behavior in both development and production environments.

What types of evals does Agent CI support? Agent CI supports accuracy evals, performance evals, safety evals, consistency evals, LLM evals, and custom evals.

Next Steps

For detailed configuration options and examples:

TOML Configuration Overview - Complete syntax reference
Accuracy Evaluations - Exact matching and validation
Performance Evaluations - Speed and resource testing
Safety Evaluations - Security and content filtering
Consistency Evaluations - Reliability across runs
LLM Evaluations - AI-powered quality assessment