Performance Evaluations

Performance evaluations measure response time, latency, and resource usage with configurable thresholds. They're critical for ensuring your agents meet production performance requirements.

What Performance Evaluations Test

Performance evaluations measure computational efficiency and speed:

Response latency - How long agents take to respond
Token usage - Input and output token consumption
Resource constraints - Memory and processing limits
Throughput - Requests processed per unit time

When to Use Performance Evaluations

Use performance evaluations to ensure production readiness:

✅ Production deployment - Verify agents meet SLA requirements
✅ Cost optimization - Monitor token usage and resource consumption
✅ User experience - Ensure acceptable response times
✅ Load testing - Validate performance under concurrent requests
✅ Regression detection - Catch performance degradations early
❌ Functional correctness - Use accuracy evaluations instead
❌ Content quality - Use LLM or semantic evaluations instead

Configuration

Basic Configuration

[eval]
description = "Test response performance"
type = "performance"
targets.agents = ["*"]               # Test all agents
targets.tools = []                   # Skip tools

Supported Fields

Field	Type	Required	Description
`description`	String	Yes	Brief description of the test
`type`	String	Yes	Must be `"performance"`
`targets.agents`	Array[String]	Yes	Agents to test (`["*"]` for all)
`targets.tools`	Array[String]	Yes	Tools to test (`["*"]` for all)
`iterations`	Integer	No	Times to run each test case (default: 1)

Test Cases

Basic Test Case Fields

Field	Type	Required	Description
`prompt`	String	No*	Input prompt for agents
`context`	Object	No*	Parameters for tools or agent context
`latency`	Object	No	Latency constraints and thresholds
`tokens`	Object	No	Token usage constraints

*Either prompt or context (or both) must be provided

Latency Configuration

The latency object supports these fields:

Field	Type	Unit	Description
`max`	Number	seconds	Maximum allowed response time
`max_ms`	Number	milliseconds	Maximum allowed response time
`min`	Number	seconds	Minimum expected response time
`min_ms`	Number	milliseconds	Minimum expected response time

Token Configuration

The tokens object supports these fields:

Field	Type	Description
`max`	Number	Maximum total tokens (input + output)
`min`	Number	Minimum total tokens (input + output)
`input_max`	Number	Maximum input tokens
`input_min`	Number	Minimum input tokens
`output_max`	Number	Maximum output tokens
`output_min`	Number	Minimum output tokens

Examples

Basic Latency Testing

[eval]
description = "Test basic response times"
type = "performance"
targets.agents = ["*"]
targets.tools = []

# Simple questions should be fast
[[eval.cases]]
prompt = "What is 2 + 2?"
latency = { max_ms = 1000 }          # Must respond within 1 second

# Complex queries can take longer
[[eval.cases]]
prompt = "Analyze this quarterly report and provide insights"
latency = { max = 30 }               # Allow up to 30 seconds

Token Usage Monitoring

[eval]
description = "Monitor token consumption"
type = "performance"
targets.agents = ["chat-assistant"]
targets.tools = []

# Short responses for simple questions
[[eval.cases]]
prompt = "Say hello"
tokens = { max = 50, output_max = 10 }

# Reasonable limits for complex tasks
[[eval.cases]]
prompt = "Explain quantum physics in simple terms"
tokens = { max = 1000, output_min = 100, output_max = 500 }

Combined Performance Testing

[eval]
description = "Comprehensive performance validation"
type = "performance"
targets.agents = ["customer-support"]
targets.tools = []

# Fast, efficient responses for common questions
[[eval.cases]]
prompt = "What are your business hours?"
latency = { max_ms = 2000 }
tokens = { max = 100 }

# Reasonable performance for complex issues
[[eval.cases]]
prompt = "I'm having trouble with my account setup. Can you help me troubleshoot?"
latency = { max = 15 }
tokens = { max = 800, min = 200 }

Tool Performance Testing

[eval]
description = "Test API tool performance"
type = "performance"
targets.agents = []
targets.tools = ["weather-api", "database-query"]

# API calls should be fast
[[eval.cases]]
context = { city = "San Francisco" }
latency = { max_ms = 3000 }          # API timeout

# Database queries should be efficient
[[eval.cases]]
context = { query = "SELECT * FROM users LIMIT 10" }
latency = { max_ms = 500 }

Load Testing Simulation

[eval]
description = "Simulate concurrent load"
type = "performance"
targets.agents = ["api-assistant"]
targets.tools = []
iterations = 10                      # Run each test 10 times

[[eval.cases]]
prompt = "Process this data"
latency = { max = 5 }
tokens = { max = 500 }

Performance Regression Testing

[eval]
description = "Detect performance regressions"
type = "performance"
targets.agents = ["*"]
targets.tools = []

# Baseline performance expectations
[[eval.cases]]
prompt = "Simple greeting"
latency = { max_ms = 800 }           # Should be very fast

[[eval.cases]]
prompt = "Medium complexity task"
latency = { max = 10 }               # Reasonable response time

[[eval.cases]]
prompt = "Complex analysis request"
latency = { max = 45 }               # Allow longer for complex tasks

Best Practices

Setting Realistic Thresholds

Baseline first, then optimize:

# Start with generous limits
[[eval.cases]]
prompt = "Test query"
latency = { max = 60 }

# Tighten limits as you optimize
latency = { max = 30 }
latency = { max = 15 }

Account for external dependencies:

# API calls need higher latency allowances
[[eval.cases]]
context = { external_api_call = true }
latency = { max = 10 }             # Account for network latency

# Local processing can be faster
[[eval.cases]]
prompt = "Local calculation"
latency = { max_ms = 2000 }

Token Optimization

Monitor input efficiency:

[[eval.cases]]
prompt = "Brief question"
tokens = { input_max = 20, output_max = 50 }

Prevent runaway outputs:

[[eval.cases]]
prompt = "Explain this concept"
tokens = { output_max = 300 }      # Prevent excessive responses

Performance Testing Strategy

Test different complexity levels:

# Simple tasks
[[eval.cases]]
prompt = "What is 1+1?"
latency = { max_ms = 500 }

# Medium tasks
[[eval.cases]]
prompt = "Summarize this paragraph"
latency = { max = 5 }

# Complex tasks
[[eval.cases]]
prompt = "Analyze and provide detailed recommendations"
latency = { max = 30 }

Use iterations for consistency:

[eval]
iterations = 5                     # Run multiple times for average

[[eval.cases]]
prompt = "Performance test"
latency = { max = 10 }

Production Readiness Checklist

Use performance evaluations to verify:

✅ 95th percentile response time under acceptable limits
✅ Token costs within budget constraints
✅ No timeout failures under normal load
✅ Consistent performance across multiple runs
✅ Resource usage stays within limits

Advanced Configuration

Time Unit Flexibility

# Use seconds for longer operations
latency = { max = 30 }

# Use milliseconds for precise timing
latency = { max_ms = 1500 }

# Mix units as needed
latency = { min_ms = 100, max = 5 }

Range Testing

# Ensure responses aren't too fast (may indicate errors)
# or too slow (poor user experience)
[[eval.cases]]
prompt = "Standard query"
latency = { min_ms = 200, max_ms = 3000 }
tokens = { min = 10, max = 200 }

Multiple Iterations for Statistics

[eval]
description = "Statistical performance analysis"
type = "performance"
iterations = 20                      # Run 20 times for statistics

[[eval.cases]]
prompt = "Benchmark query"
latency = { max = 5 }               # Average should be under 5 seconds

Troubleshooting

Common Issues

Intermittent failures:

# Use higher iteration counts to catch inconsistent performance
[eval]
iterations = 10

[[eval.cases]]
prompt = "Test case"
latency = { max = 15 }

Cold start delays:

# Account for cold start latency in serverless environments
[[eval.cases]]
prompt = "First request"
latency = { max = 20 }               # Allow extra time for cold starts

Token counting discrepancies:

# Be generous with token limits during initial testing
[[eval.cases]]
prompt = "Test response"
tokens = { max = 1000 }             # Start high, optimize down

Performance Debugging

Start with broad limits:

# Begin with generous thresholds
latency = { max = 60 }
tokens = { max = 2000 }

Gradually tighten constraints:

# Iteratively reduce limits
latency = { max = 30 }
latency = { max = 15 }
latency = { max = 10 }

Use multiple test cases for different scenarios:

# Test various input complexities
[[eval.cases]]
prompt = "Simple"
latency = { max = 2 }

[[eval.cases]]
prompt = "Complex analysis with multiple steps"
latency = { max = 20 }

Interpreting Results

Performance evaluation results include:

Average latency across all iterations
95th percentile latency for reliability assessment
Token usage statistics for cost analysis
Success/failure rates for reliability metrics
Performance trends over time

Next Steps

Safety Evaluations - Test security and content filtering
Consistency Evaluations - Verify reliable behavior
Configuration Overview - Complete TOML reference