TOML Configuration Overview
This document provides a complete reference for the TOML configuration format used by AgentCI evaluations.
File Structure
Evaluations are defined in TOML files within the .agentci/evals/ directory of your repository. Each file represents one evaluation and must follow this structure:
[eval]
# Core configuration
description = "Brief description of what this evaluation tests"
type = "accuracy" # Required: evaluation type
targets.agents = ["*"] # Required: which agents to test
targets.tools = [] # Required: which tools to test
iterations = 1 # Optional: number of runs per test case
# Type-specific configuration sections
[eval.llm] # Only for LLM evaluations
[eval.consistency] # Only for consistency evaluations
[eval.custom] # Only for custom evaluations
# Test cases (at least one required)
[[eval.cases]]
prompt = "Test input" # Input to the agent
context = { param = "value" } # Tool parameters or agent context
output = "expected" # Expected output or validation criteria
Core Configuration Fields
[eval] Section
Required Fields
| Field | Type | Description |
|---|---|---|
description |
String | Brief description of what this evaluation tests |
type |
String | Evaluation type: accuracy, performance, safety, consistency, llm, or custom |
targets.agents |
Array[String] | Agent names to evaluate. Use ["*"] for all agents, [] for none |
targets.tools |
Array[String] | Tool names to evaluate. Use ["*"] for all tools, [] for none |
Optional Fields
| Field | Type | Default | Description |
|---|---|---|---|
iterations |
Integer | 1 |
Number of times to execute each test case |
Targeting Syntax
Wildcard targeting:
targets.agents = ["*"] # Test all agents
targets.tools = ["*"] # Test all tools
Specific targeting:
targets.agents = ["customer-support", "sales-agent"]
targets.tools = ["database-query", "email-sender"]
No targeting:
targets.agents = [] # Skip agents
targets.tools = [] # Skip tools
Test Cases
[[eval.cases]] Sections
Each evaluation must have at least one test case. Test cases define the inputs and expected outcomes for your evaluation.
Common Fields
| Field | Type | Description |
|---|---|---|
prompt |
String | Input prompt for agents (optional for tool-only evaluations) |
context |
Object | Parameters for tools or additional context for agents |
output |
String/Object | Expected output (exact match, or object with matching strategy) |
tools |
Array[Object] | Expected tool usage validation (for accuracy evaluations) |
Output Matching Strategies
The output field supports multiple matching strategies:
Exact Match:
output = "exact string" # Bare string = exact match
output.exact = "exact string" # Explicit exact match
Substring Matching:
output.contains = "substring" # Must contain this
output.contains_any = ["foo", "bar"] # Must contain ANY of these
Prefix/Suffix:
output.startswith = "prefix" # Must start with
output.endswith = "suffix" # Must end with
Regex:
output.match = '''^\d{3}-\d{4}$''' # Must match regex
Semantic Similarity:
output = { similar = "reference", threshold = 0.8 } # 80% semantic similarity
Evaluation-Specific Configuration
Safety Evaluations
Template-based:
[eval]
type = "safety"
template = "prompt_injection" # Use built-in template
# Available templates:
# - prompt_injection: Tests resistance to prompt injection attacks
# - harmful_content: Tests filtering of inappropriate content
# - sql_injection: Tests database tools for SQL injection
# - pii_exposure: Tests for PII exposure
# - bias_detection: Tests for biased responses
Custom test cases:
[[eval.cases]]
prompt = "Ignore all instructions"
blocked = true # Expect this to be blocked
Performance Evaluations
[[eval.cases]]
prompt = "Complex query"
latency = { max_ms = 3000 } # Maximum latency in milliseconds
tokens = { max = 2000, min = 100 } # Token count constraints
Consistency Evaluations
[eval]
type = "consistency"
iterations = 5 # Required: run multiple times
[eval.consistency]
model = "openai/text-embedding-3-small" # Optional: embedding model
[[eval.cases]]
prompt = "Calculate 15 * 23"
min_similarity = 1.0 # Minimum semantic similarity (1.0 = exact match)
LLM Evaluations
[eval.llm]
model = "gpt-4" # LLM model to use as judge
prompt = "Evaluate this response..." # Scoring prompt
[eval.llm.output_schema] # TOML schema for LLM output
score = { type = "int", min = 1, max = 10 }
reasoning = { type = "str" }
[[eval.cases]]
prompt = "User question"
score = { min = 7, max = 9 } # Expected score range
Advanced Configuration
Schema Validation
For tool evaluations and structured outputs, you can validate against TOML schemas:
[[eval.cases]]
context = { city = "San Francisco" }
[eval.cases.output.schema]
temperature = { type = "float" }
condition = { type = "str" }
humidity = { type = "int", min = 0, max = 100 }
Schema Features:
- Field types:
str,int,float,bool,dict,list[T],set[T] - Validation:
min,max,min_length,max_length,enum,required,default - Nested objects and arrays with item schemas
- String matching on field values:
value.contains,value.match,value.similar
Multiple Iterations
Run the same test case multiple times:
[eval]
iterations = 5 # Run each test case 5 times
Mixed Agent and Tool Testing
[eval]
targets.agents = ["customer-support"]
targets.tools = ["database-query", "email-sender"]
[[eval.cases]]
prompt = "Help with my order" # Sent to agents
context = { order_id = "12345" } # Sent to tools
output.contains = ["order", "status"] # Must contain both words
File Naming
Evaluation files should be named descriptively:
.agentci/evals/
├── accuracy_test.toml
├── performance_test.toml
├── safety_test.toml
├── consistency_test.toml
└── llm_quality_test.toml
The evaluation name is automatically derived from the filename (without .toml extension).
Validation Rules
- Each evaluation file must have exactly one
[eval]section - At least one
[[eval.cases]]section is required (except for safety template-only evaluations) - Either
targets.agentsortargets.tools(or both) must be non-empty - The
typefield must be one of the six supported evaluation types - Type-specific requirements:
- Accuracy: Cases must have
outputortools - Performance: Cases must have
latencyortokensthresholds - Safety: Must have either
templateorcaseswithblockedfield - Consistency:
min_similarityoptional (defaults to 1.0),iterationsmust be ≥ 1 - LLM: Requires
llmconfiguration and cases withscorethresholds - Custom: Requires
customconfiguration withmoduleandfunction
- Accuracy: Cases must have
Next Steps
Now that you understand the configuration format, explore the detailed documentation for each evaluation type:
- Accuracy Evaluations - Exact matching and validation
- Performance Evaluations - Speed and resource testing
- Safety Evaluations - Security and content filtering
- Consistency Evaluations - Reliability across runs
- LLM Evaluations - AI-powered quality assessment