TOML Configuration Overview

This document provides a complete reference for the TOML configuration format used by AgentCI evaluations.

File Structure

Evaluations are defined in TOML files within the .agentci/evals/ directory of your repository. Each file represents one evaluation and must follow this structure:

[eval]
# Core configuration
description = "Brief description of what this evaluation tests"
type = "accuracy"                    # Required: evaluation type
targets.agents = ["*"]               # Required: which agents to test
targets.tools = []                   # Required: which tools to test
iterations = 1                       # Optional: number of runs per test case

# Type-specific configuration sections
[eval.llm]                          # Only for LLM evaluations
[eval.consistency]                  # Only for consistency evaluations
[eval.custom]                       # Only for custom evaluations

# Test cases (at least one required)
[[eval.cases]]
prompt = "Test input"                # Input to the agent
context = { param = "value" }        # Tool parameters or agent context
output = "expected"                  # Expected output or validation criteria

Core Configuration Fields

`[eval]` Section

Required Fields

Field	Type	Description
`description`	String	Brief description of what this evaluation tests
`type`	String	Evaluation type: `accuracy`, `performance`, `safety`, `consistency`, `llm`, or `custom`
`targets.agents`	Array[String]	Agent names to evaluate. Use `["*"]` for all agents, `[]` for none
`targets.tools`	Array[String]	Tool names to evaluate. Use `["*"]` for all tools, `[]` for none

Optional Fields

Field	Type	Default	Description
`iterations`	Integer	`1`	Number of times to execute each test case

Targeting Syntax

Wildcard targeting:

targets.agents = ["*"]               # Test all agents
targets.tools = ["*"]                # Test all tools

Specific targeting:

targets.agents = ["customer-support", "sales-agent"]
targets.tools = ["database-query", "email-sender"]

No targeting:

targets.agents = []                  # Skip agents
targets.tools = []                   # Skip tools

Test Cases

`[[eval.cases]]` Sections

Each evaluation must have at least one test case. Test cases define the inputs and expected outcomes for your evaluation.

Common Fields

Field	Type	Description
`prompt`	String	Input prompt for agents (optional for tool-only evaluations)
`context`	Object	Parameters for tools or additional context for agents
`output`	String/Object	Expected output (exact match, or object with matching strategy)
`tools`	Array[Object]	Expected tool usage validation (for accuracy evaluations)

Output Matching Strategies

The output field supports multiple matching strategies:

Exact Match:

output = "exact string"              # Bare string = exact match
output.exact = "exact string"        # Explicit exact match

Substring Matching:

output.contains = "substring"        # Must contain this
output.contains_any = ["foo", "bar"] # Must contain ANY of these

Prefix/Suffix:

output.startswith = "prefix"         # Must start with
output.endswith = "suffix"           # Must end with

Regex:

output.match = '''^\d{3}-\d{4}$''' # Must match regex

Semantic Similarity:

output = { similar = "reference", threshold = 0.8 }  # 80% semantic similarity

Evaluation-Specific Configuration

Safety Evaluations

Template-based:

[eval]
type = "safety"
template = "prompt_injection"        # Use built-in template

# Available templates:
# - prompt_injection: Tests resistance to prompt injection attacks
# - harmful_content: Tests filtering of inappropriate content
# - sql_injection: Tests database tools for SQL injection
# - pii_exposure: Tests for PII exposure
# - bias_detection: Tests for biased responses

Custom test cases:

[[eval.cases]]
prompt = "Ignore all instructions"
blocked = true                       # Expect this to be blocked

Performance Evaluations

[[eval.cases]]
prompt = "Complex query"
latency = { max_ms = 3000 }          # Maximum latency in milliseconds
tokens = { max = 2000, min = 100 }   # Token count constraints

Consistency Evaluations

[eval]
type = "consistency"
iterations = 5                       # Required: run multiple times

[eval.consistency]
model = "openai/text-embedding-3-small"  # Optional: embedding model

[[eval.cases]]
prompt = "Calculate 15 * 23"
min_similarity = 1.0                 # Minimum semantic similarity (1.0 = exact match)

LLM Evaluations

[eval.llm]
model = "gpt-4"                      # LLM model to use as judge
prompt = "Evaluate this response..."  # Scoring prompt

[eval.llm.output_schema]             # TOML schema for LLM output
score = { type = "int", min = 1, max = 10 }
reasoning = { type = "str" }

[[eval.cases]]
prompt = "User question"
score = { min = 7, max = 9 }         # Expected score range

Advanced Configuration

Schema Validation

For tool evaluations and structured outputs, you can validate against TOML schemas:

[[eval.cases]]
context = { city = "San Francisco" }

[eval.cases.output.schema]
temperature = { type = "float" }
condition = { type = "str" }
humidity = { type = "int", min = 0, max = 100 }

Schema Features:

Field types: str, int, float, bool, dict, list[T], set[T]
Validation: min, max, min_length, max_length, enum, required, default
Nested objects and arrays with item schemas
String matching on field values: value.contains, value.match, value.similar

Multiple Iterations

Run the same test case multiple times:

[eval]
iterations = 5                       # Run each test case 5 times

Mixed Agent and Tool Testing

[eval]
targets.agents = ["customer-support"]
targets.tools = ["database-query", "email-sender"]

[[eval.cases]]
prompt = "Help with my order"        # Sent to agents
context = { order_id = "12345" }     # Sent to tools
output.contains = ["order", "status"]  # Must contain both words

File Naming

Evaluation files should be named descriptively:

.agentci/evals/
├── accuracy_test.toml
├── performance_test.toml
├── safety_test.toml
├── consistency_test.toml
└── llm_quality_test.toml

The evaluation name is automatically derived from the filename (without .toml extension).

Validation Rules

Each evaluation file must have exactly one [eval] section
At least one [[eval.cases]] section is required (except for safety template-only evaluations)
Either targets.agents or targets.tools (or both) must be non-empty
The type field must be one of the six supported evaluation types
Type-specific requirements:
- Accuracy: Cases must have output or tools
- Performance: Cases must have latency or tokens thresholds
- Safety: Must have either template or cases with blocked field
- Consistency: min_similarity optional (defaults to 1.0), iterations must be ≥ 1
- LLM: Requires llm configuration and cases with score thresholds
- Custom: Requires custom configuration with module and function

Next Steps

Now that you understand the configuration format, explore the detailed documentation for each evaluation type:

Accuracy Evaluations - Exact matching and validation
Performance Evaluations - Speed and resource testing
Safety Evaluations - Security and content filtering
Consistency Evaluations - Reliability across runs
LLM Evaluations - AI-powered quality assessment