Performance Evaluations
Performance evaluations measure response time, latency, and resource usage with configurable thresholds. They're critical for ensuring your agents meet production performance requirements.
What Performance Evaluations Test
Performance evaluations measure computational efficiency and speed:
- Response latency - How long agents take to respond
- Token usage - Input and output token consumption
- Resource constraints - Memory and processing limits
- Throughput - Requests processed per unit time
When to Use Performance Evaluations
Use performance evaluations to ensure production readiness:
- ✅ Production deployment - Verify agents meet SLA requirements
- ✅ Cost optimization - Monitor token usage and resource consumption
- ✅ User experience - Ensure acceptable response times
- ✅ Load testing - Validate performance under concurrent requests
- ✅ Regression detection - Catch performance degradations early
- ❌ Functional correctness - Use accuracy evaluations instead
- ❌ Content quality - Use LLM or semantic evaluations instead
Configuration
Basic Configuration
[eval]
description = "Test response performance"
type = "performance"
targets.agents = ["*"] # Test all agents
targets.tools = [] # Skip tools
Supported Fields
| Field | Type | Required | Description |
|---|---|---|---|
description |
String | Yes | Brief description of the test |
type |
String | Yes | Must be "performance" |
targets.agents |
Array[String] | Yes | Agents to test (["*"] for all) |
targets.tools |
Array[String] | Yes | Tools to test (["*"] for all) |
iterations |
Integer | No | Times to run each test case (default: 1) |
Test Cases
Basic Test Case Fields
| Field | Type | Required | Description |
|---|---|---|---|
prompt |
String | No* | Input prompt for agents |
context |
Object | No* | Parameters for tools or agent context |
latency |
Object | No | Latency constraints and thresholds |
tokens |
Object | No | Token usage constraints |
*Either prompt or context (or both) must be provided
Latency Configuration
The latency object supports these fields:
| Field | Type | Unit | Description |
|---|---|---|---|
max |
Number | seconds | Maximum allowed response time |
max_ms |
Number | milliseconds | Maximum allowed response time |
min |
Number | seconds | Minimum expected response time |
min_ms |
Number | milliseconds | Minimum expected response time |
Token Configuration
The tokens object supports these fields:
| Field | Type | Description |
|---|---|---|
max |
Number | Maximum total tokens (input + output) |
min |
Number | Minimum total tokens (input + output) |
input_max |
Number | Maximum input tokens |
input_min |
Number | Minimum input tokens |
output_max |
Number | Maximum output tokens |
output_min |
Number | Minimum output tokens |
Examples
Basic Latency Testing
[eval]
description = "Test basic response times"
type = "performance"
targets.agents = ["*"]
targets.tools = []
# Simple questions should be fast
[[eval.cases]]
prompt = "What is 2 + 2?"
latency = { max_ms = 1000 } # Must respond within 1 second
# Complex queries can take longer
[[eval.cases]]
prompt = "Analyze this quarterly report and provide insights"
latency = { max = 30 } # Allow up to 30 seconds
Token Usage Monitoring
[eval]
description = "Monitor token consumption"
type = "performance"
targets.agents = ["chat-assistant"]
targets.tools = []
# Short responses for simple questions
[[eval.cases]]
prompt = "Say hello"
tokens = { max = 50, output_max = 10 }
# Reasonable limits for complex tasks
[[eval.cases]]
prompt = "Explain quantum physics in simple terms"
tokens = { max = 1000, output_min = 100, output_max = 500 }
Combined Performance Testing
[eval]
description = "Comprehensive performance validation"
type = "performance"
targets.agents = ["customer-support"]
targets.tools = []
# Fast, efficient responses for common questions
[[eval.cases]]
prompt = "What are your business hours?"
latency = { max_ms = 2000 }
tokens = { max = 100 }
# Reasonable performance for complex issues
[[eval.cases]]
prompt = "I'm having trouble with my account setup. Can you help me troubleshoot?"
latency = { max = 15 }
tokens = { max = 800, min = 200 }
Tool Performance Testing
[eval]
description = "Test API tool performance"
type = "performance"
targets.agents = []
targets.tools = ["weather-api", "database-query"]
# API calls should be fast
[[eval.cases]]
context = { city = "San Francisco" }
latency = { max_ms = 3000 } # API timeout
# Database queries should be efficient
[[eval.cases]]
context = { query = "SELECT * FROM users LIMIT 10" }
latency = { max_ms = 500 }
Load Testing Simulation
[eval]
description = "Simulate concurrent load"
type = "performance"
targets.agents = ["api-assistant"]
targets.tools = []
iterations = 10 # Run each test 10 times
[[eval.cases]]
prompt = "Process this data"
latency = { max = 5 }
tokens = { max = 500 }
Performance Regression Testing
[eval]
description = "Detect performance regressions"
type = "performance"
targets.agents = ["*"]
targets.tools = []
# Baseline performance expectations
[[eval.cases]]
prompt = "Simple greeting"
latency = { max_ms = 800 } # Should be very fast
[[eval.cases]]
prompt = "Medium complexity task"
latency = { max = 10 } # Reasonable response time
[[eval.cases]]
prompt = "Complex analysis request"
latency = { max = 45 } # Allow longer for complex tasks
Best Practices
Setting Realistic Thresholds
Baseline first, then optimize:
# Start with generous limits [[eval.cases]] prompt = "Test query" latency = { max = 60 } # Tighten limits as you optimize latency = { max = 30 } latency = { max = 15 }Account for external dependencies:
# API calls need higher latency allowances [[eval.cases]] context = { external_api_call = true } latency = { max = 10 } # Account for network latency # Local processing can be faster [[eval.cases]] prompt = "Local calculation" latency = { max_ms = 2000 }
Token Optimization
Monitor input efficiency:
[[eval.cases]] prompt = "Brief question" tokens = { input_max = 20, output_max = 50 }Prevent runaway outputs:
[[eval.cases]] prompt = "Explain this concept" tokens = { output_max = 300 } # Prevent excessive responses
Performance Testing Strategy
Test different complexity levels:
# Simple tasks [[eval.cases]] prompt = "What is 1+1?" latency = { max_ms = 500 } # Medium tasks [[eval.cases]] prompt = "Summarize this paragraph" latency = { max = 5 } # Complex tasks [[eval.cases]] prompt = "Analyze and provide detailed recommendations" latency = { max = 30 }Use iterations for consistency:
[eval] iterations = 5 # Run multiple times for average [[eval.cases]] prompt = "Performance test" latency = { max = 10 }
Production Readiness Checklist
Use performance evaluations to verify:
- ✅ 95th percentile response time under acceptable limits
- ✅ Token costs within budget constraints
- ✅ No timeout failures under normal load
- ✅ Consistent performance across multiple runs
- ✅ Resource usage stays within limits
Advanced Configuration
Time Unit Flexibility
# Use seconds for longer operations
latency = { max = 30 }
# Use milliseconds for precise timing
latency = { max_ms = 1500 }
# Mix units as needed
latency = { min_ms = 100, max = 5 }
Range Testing
# Ensure responses aren't too fast (may indicate errors)
# or too slow (poor user experience)
[[eval.cases]]
prompt = "Standard query"
latency = { min_ms = 200, max_ms = 3000 }
tokens = { min = 10, max = 200 }
Multiple Iterations for Statistics
[eval]
description = "Statistical performance analysis"
type = "performance"
iterations = 20 # Run 20 times for statistics
[[eval.cases]]
prompt = "Benchmark query"
latency = { max = 5 } # Average should be under 5 seconds
Troubleshooting
Common Issues
Intermittent failures:
# Use higher iteration counts to catch inconsistent performance
[eval]
iterations = 10
[[eval.cases]]
prompt = "Test case"
latency = { max = 15 }
Cold start delays:
# Account for cold start latency in serverless environments
[[eval.cases]]
prompt = "First request"
latency = { max = 20 } # Allow extra time for cold starts
Token counting discrepancies:
# Be generous with token limits during initial testing
[[eval.cases]]
prompt = "Test response"
tokens = { max = 1000 } # Start high, optimize down
Performance Debugging
Start with broad limits:
# Begin with generous thresholds latency = { max = 60 } tokens = { max = 2000 }Gradually tighten constraints:
# Iteratively reduce limits latency = { max = 30 } latency = { max = 15 } latency = { max = 10 }Use multiple test cases for different scenarios:
# Test various input complexities [[eval.cases]] prompt = "Simple" latency = { max = 2 } [[eval.cases]] prompt = "Complex analysis with multiple steps" latency = { max = 20 }
Interpreting Results
Performance evaluation results include:
- Average latency across all iterations
- 95th percentile latency for reliability assessment
- Token usage statistics for cost analysis
- Success/failure rates for reliability metrics
- Performance trends over time
Next Steps
- Safety Evaluations - Test security and content filtering
- Consistency Evaluations - Verify reliable behavior
- Configuration Overview - Complete TOML reference