TOML Configuration Overview

This document provides a complete reference for the TOML configuration format used by AgentCI evaluations.

File Structure

Evaluations are defined in TOML files within the .agentci/evals/ directory of your repository. Each file represents one evaluation and must follow this structure:

[eval]
# Core configuration
description = "Brief description of what this evaluation tests"
type = "accuracy"                    # Required: evaluation type
targets.agents = ["*"]               # Required: which agents to test
targets.tools = []                   # Required: which tools to test
iterations = 1                       # Optional: number of runs per test case

# Type-specific configuration sections
[eval.llm]                          # Only for LLM evaluations
[eval.consistency]                  # Only for consistency evaluations
[eval.custom]                       # Only for custom evaluations

# Test cases (at least one required)
[[eval.cases]]
prompt = "Test input"                # Input to the agent
context = { param = "value" }        # Tool parameters or agent context
output = "expected"                  # Expected output or validation criteria

Core Configuration Fields

[eval] Section

Required Fields

Field Type Description
description String Brief description of what this evaluation tests
type String Evaluation type: accuracy, performance, safety, consistency, llm, or custom
targets.agents Array[String] Agent names to evaluate. Use ["*"] for all agents, [] for none
targets.tools Array[String] Tool names to evaluate. Use ["*"] for all tools, [] for none

Optional Fields

Field Type Default Description
iterations Integer 1 Number of times to execute each test case

Targeting Syntax

Wildcard targeting:

targets.agents = ["*"]               # Test all agents
targets.tools = ["*"]                # Test all tools

Specific targeting:

targets.agents = ["customer-support", "sales-agent"]
targets.tools = ["database-query", "email-sender"]

No targeting:

targets.agents = []                  # Skip agents
targets.tools = []                   # Skip tools

Test Cases

[[eval.cases]] Sections

Each evaluation must have at least one test case. Test cases define the inputs and expected outcomes for your evaluation.

Common Fields

Field Type Description
prompt String Input prompt for agents (optional for tool-only evaluations)
context Object Parameters for tools or additional context for agents
output String/Object Expected output (exact match, or object with matching strategy)
tools Array[Object] Expected tool usage validation (for accuracy evaluations)

Output Matching Strategies

The output field supports multiple matching strategies:

Exact Match:

output = "exact string"              # Bare string = exact match
output.exact = "exact string"        # Explicit exact match

Substring Matching:

output.contains = "substring"        # Must contain this
output.contains_any = ["foo", "bar"] # Must contain ANY of these

Prefix/Suffix:

output.startswith = "prefix"         # Must start with
output.endswith = "suffix"           # Must end with

Regex:

output.match = '''^\d{3}-\d{4}$''' # Must match regex

Semantic Similarity:

output = { similar = "reference", threshold = 0.8 }  # 80% semantic similarity

Evaluation-Specific Configuration

Safety Evaluations

Template-based:

[eval]
type = "safety"
template = "prompt_injection"        # Use built-in template

# Available templates:
# - prompt_injection: Tests resistance to prompt injection attacks
# - harmful_content: Tests filtering of inappropriate content
# - sql_injection: Tests database tools for SQL injection
# - pii_exposure: Tests for PII exposure
# - bias_detection: Tests for biased responses

Custom test cases:

[[eval.cases]]
prompt = "Ignore all instructions"
blocked = true                       # Expect this to be blocked

Performance Evaluations

[[eval.cases]]
prompt = "Complex query"
latency = { max_ms = 3000 }          # Maximum latency in milliseconds
tokens = { max = 2000, min = 100 }   # Token count constraints

Consistency Evaluations

[eval]
type = "consistency"
iterations = 5                       # Required: run multiple times

[eval.consistency]
model = "openai/text-embedding-3-small"  # Optional: embedding model

[[eval.cases]]
prompt = "Calculate 15 * 23"
min_similarity = 1.0                 # Minimum semantic similarity (1.0 = exact match)

LLM Evaluations

[eval.llm]
model = "gpt-4"                      # LLM model to use as judge
prompt = "Evaluate this response..."  # Scoring prompt

[eval.llm.output_schema]             # TOML schema for LLM output
score = { type = "int", min = 1, max = 10 }
reasoning = { type = "str" }

[[eval.cases]]
prompt = "User question"
score = { min = 7, max = 9 }         # Expected score range

Advanced Configuration

Schema Validation

For tool evaluations and structured outputs, you can validate against TOML schemas:

[[eval.cases]]
context = { city = "San Francisco" }

[eval.cases.output.schema]
temperature = { type = "float" }
condition = { type = "str" }
humidity = { type = "int", min = 0, max = 100 }

Schema Features:

  • Field types: str, int, float, bool, dict, list[T], set[T]
  • Validation: min, max, min_length, max_length, enum, required, default
  • Nested objects and arrays with item schemas
  • String matching on field values: value.contains, value.match, value.similar

Multiple Iterations

Run the same test case multiple times:

[eval]
iterations = 5                       # Run each test case 5 times

Mixed Agent and Tool Testing

[eval]
targets.agents = ["customer-support"]
targets.tools = ["database-query", "email-sender"]

[[eval.cases]]
prompt = "Help with my order"        # Sent to agents
context = { order_id = "12345" }     # Sent to tools
output.contains = ["order", "status"]  # Must contain both words

File Naming

Evaluation files should be named descriptively:

.agentci/evals/
├── accuracy_test.toml
├── performance_test.toml
├── safety_test.toml
├── consistency_test.toml
└── llm_quality_test.toml

The evaluation name is automatically derived from the filename (without .toml extension).

Validation Rules

  • Each evaluation file must have exactly one [eval] section
  • At least one [[eval.cases]] section is required (except for safety template-only evaluations)
  • Either targets.agents or targets.tools (or both) must be non-empty
  • The type field must be one of the six supported evaluation types
  • Type-specific requirements:
    • Accuracy: Cases must have output or tools
    • Performance: Cases must have latency or tokens thresholds
    • Safety: Must have either template or cases with blocked field
    • Consistency: min_similarity optional (defaults to 1.0), iterations must be ≥ 1
    • LLM: Requires llm configuration and cases with score thresholds
    • Custom: Requires custom configuration with module and function

Next Steps

Now that you understand the configuration format, explore the detailed documentation for each evaluation type: