Evaluations (Evals)

Evaluations (commonly called "evals") are automated tests that validate agent behavior and performance. Agent CI evals provide objective metrics for assessing agent functionality across different dimensions including accuracy, performance, safety, and consistency.

What are Evals?

Evals are automated testing frameworks designed specifically for AI agents and LLM applications. Whether you're running accuracy evals, performance evals, or safety evals, Agent CI provides comprehensive evaluation tools to ensure your agents behave as expected.

Overview

Agent CI evaluations run in two contexts:

  • CI runs: Execute on pull requests as part of the development workflow
  • Runtime runs: Analyze live agent interactions in production environments

Evaluation configurations are stored as TOML files in the .agentci/evals/ directory of your repository, keeping test definitions version-controlled alongside your code.

Evaluation Types

Agent CI provides six built-in eval types for comprehensive agent testing:

Accuracy Evals

Accuracy evals test whether agent outputs match expected results using exact matches, pattern matching, or JSON schema validation.

Performance Evals

Performance evals measure response time, latency, and resource usage with configurable thresholds.

Safety Evals

Safety evals validate content filtering and security resistance against prompt injection, harmful content, and other security risks. Include built-in templates for common attack vectors.

Consistency Evals

Consistency evals run identical inputs multiple times to verify low-variance outputs across executions.

LLM Evals

LLM evals use LLM-as-judge methodology with configurable scoring criteria for subjective quality measures.

Eval Configuration

Eval configurations use TOML format and are stored in the .agentci/evals/ directory. Each eval file defines:

  • Evaluation type and parameters
  • Target agents and tools to test
  • Test cases with inputs and expected outputs
  • Scoring criteria and thresholds

Example structure:

[eval]
description = "Test agent accuracy"
type = "accuracy"
targets.agents = ["customer-support"]
targets.tools = []

[[eval.cases]]
prompt = "What are your hours?"
output = "We are open {{*}} to {{*}}"

Built-in Security Eval Templates

Safety evals include pre-configured templates:

  • Prompt injection resistance - Standard jailbreaking attempts
  • SQL injection testing - Database security validation
  • Content filtering - Harmful content detection
  • PII exposure prevention - Leak detection

Execution Context

Evaluations run in two contexts:

CI Runs

  • Triggered on pull requests
  • Provide regression detection and baseline verification
  • Generate automated PR comments with results
  • Support branch protection rules

Runtime Runs

  • Capture live agent interactions via OpenTelemetry instrumentation
  • Analyze production behavior using the same evaluation framework
  • Support selective evaluation and percentage-based sampling

Version Control Integration

All eval configurations are stored in Git:

  • TOML configuration files in .agentci/evals/ directory
  • Custom eval code versioned with application code
  • Git commit hashes used as eval identifiers
  • Same evals run across development, staging, and production environments

Results and Analysis

Evaluation results provide:

  • Binary pass/fail outcomes for each test case
  • Detailed scoring and metrics per evaluation type
  • Performance trending over time
  • Attribution to specific commits and developers
  • Aggregated success rates across evaluation runs

Getting Started with Evals

  1. Choose your eval type based on what aspect you want to test
  2. Create a .agentci/evals/ directory in your repository root
  3. Add TOML files for each eval (e.g., accuracy-test.toml)
  4. Commit and open a pull request to see your evals run

Common Eval Questions

What are evals? Evals (evaluations) are automated tests for AI agents that measure accuracy, performance, safety, and consistency.

Why use evals for AI agents? Evals provide objective, reproducible measurements of agent behavior in both development and production environments.

What types of evals does Agent CI support? Agent CI supports accuracy evals, performance evals, safety evals, consistency evals, LLM evals, and custom evals.

Next Steps

For detailed configuration options and examples: