LLM Evaluations
LLM evaluations use LLM-as-judge methodology with configurable scoring criteria for quality assessment. They're ideal for subjective quality measures like helpfulness, clarity, and appropriateness that can't be easily measured with rules-based approaches.
What LLM Evaluations Test
LLM evaluations leverage AI models to assess subjective quality dimensions:
- Response quality - Overall helpfulness and usefulness of responses
- Clarity and coherence - How well responses communicate information
- Appropriateness - Whether responses are suitable for the context
- Completeness - How thoroughly responses address user needs
- Professional tone - Maintaining appropriate communication style
- Factual accuracy - Correctness of information provided
- User satisfaction - Predicted user satisfaction with responses
When to Use LLM Evaluations
Use LLM evaluations for subjective quality assessment:
- ✅ Content quality - Assess overall response quality and helpfulness
- ✅ Subjective criteria - Evaluate aspects that require judgment
- ✅ User experience - Predict user satisfaction with responses
- ✅ Complex reasoning - Evaluate multi-step problem-solving
- ✅ Creative content - Assess creativity, originality, and engagement
- ✅ Professional communication - Evaluate tone, style, and appropriateness
- ❌ Exact matching - Use accuracy evaluations instead
- ❌ Performance metrics - Use performance evaluations instead
- ❌ Deterministic checks - Use accuracy or consistency evaluations
Configuration
Basic Configuration
[eval]
description = "LLM evaluation of response quality"
type = "llm"
targets.agents = ["*"] # Test all agents
targets.tools = [] # Skip tools
[eval.llm]
model = "gpt-4" # LLM model to use as judge
prompt = "Evaluate the helpfulness and accuracy of this response."
Supported Fields
| Field | Type | Required | Description |
|---|---|---|---|
description |
String | Yes | Brief description of the test |
type |
String | Yes | Must be "llm" |
targets.agents |
Array[String] | Yes | Agents to test (["*"] for all) |
targets.tools |
Array[String] | Yes | Tools to test (["*"] for all) |
LLM Configuration ([eval.llm])
| Field | Type | Required | Description |
|---|---|---|---|
model |
String | Yes | LLM model to use as judge |
prompt |
String | Yes | Evaluation prompt for the LLM judge |
output_schema |
Object | No | TOML schema for structured LLM output |
temperature |
Number | No | Model temperature (default: 0.1 for consistency) |
max_tokens |
Number | No | Maximum tokens for LLM response |
Available Judge Models
| Model | Best For | Cost | Speed |
|---|---|---|---|
gpt-4 |
High-quality evaluation | High | Slow |
gpt-4-turbo |
Balanced quality and speed | Medium | Medium |
gpt-3.5-turbo |
Fast, cost-effective evaluation | Low | Fast |
claude-3-opus |
Detailed, nuanced evaluation | High | Medium |
claude-3-sonnet |
Balanced evaluation | Medium | Fast |
Test Cases
Basic Test Case Fields
| Field | Type | Required | Description |
|---|---|---|---|
prompt |
String | No* | Input prompt for agents |
context |
Object | No* | Parameters for tools or agent context |
score |
Object | Yes | Expected score constraints |
criteria |
Array[String] | No | Specific evaluation criteria |
reference_response |
String | No | Optional reference for comparison |
*Either prompt or context (or both) must be provided
Score Constraints
| Constraint | Description | Example |
|---|---|---|
min |
Minimum acceptable score | { min = 7 } |
max |
Maximum acceptable score | { max = 9 } |
equal |
Exact score required | { equal = 10 } |
range |
Score within range | { min = 6, max = 8 } |
Examples
Basic Quality Assessment
[eval]
description = "Evaluate general response quality"
type = "llm"
targets.agents = ["customer-support"]
targets.tools = []
[eval.llm]
model = "gpt-4"
prompt = """
Evaluate this response on a scale of 1-10 considering:
- Helpfulness: How well does it address the user's question?
- Accuracy: Is the information provided correct?
- Clarity: Is the response easy to understand?
- Completeness: Does it fully answer the question?
Provide a score and brief reasoning.
"""
[eval.llm.output_schema]
score = { type = "int", min = 1, max = 10 }
reasoning = { type = "str" }
[[eval.cases]]
prompt = "I need help with my account login"
score = { min = 7 }
[[eval.cases]]
prompt = "What are your business hours?"
score = { min = 8 }
[[eval.cases]]
prompt = "How do I cancel my subscription?"
score = { min = 7, max = 10 }
Multi-Criteria Evaluation
[eval]
description = "Multi-dimensional quality assessment"
type = "llm"
targets.agents = ["technical-support"]
targets.tools = []
[eval.llm]
model = "gpt-4"
prompt = """
Evaluate this technical support response across multiple dimensions:
1. Technical Accuracy (1-10): Is the technical information correct?
2. Clarity (1-10): Is the explanation clear and easy to follow?
3. Completeness (1-10): Does it address all aspects of the question?
4. Actionability (1-10): Are the steps concrete and actionable?
5. Professional Tone (1-10): Is the tone appropriate and helpful?
Provide scores for each dimension and overall assessment.
"""
[eval.llm.output_schema]
technical_accuracy = { type = "int", min = 1, max = 10 }
clarity = { type = "int", min = 1, max = 10 }
completeness = { type = "int", min = 1, max = 10 }
actionability = { type = "int", min = 1, max = 10 }
professional_tone = { type = "int", min = 1, max = 10 }
overall_score = { type = "int", min = 1, max = 10 }
feedback = { type = "str" }
[[eval.cases]]
prompt = "My application keeps crashing when I try to upload files. What should I do?"
score = { min = 7 }
criteria = ["technical_accuracy", "actionability", "clarity"]
[[eval.cases]]
prompt = "I'm getting a 500 error on your API. Can you help me debug this?"
score = { min = 8 }
criteria = ["technical_accuracy", "completeness"]
Creative Content Evaluation
[eval]
description = "Evaluate creative writing quality"
type = "llm"
targets.agents = ["creative-writer"]
targets.tools = []
[eval.llm]
model = "claude-3-opus"
prompt = """
Evaluate this creative content on:
- Creativity and originality (1-10)
- Engagement and readability (1-10)
- Relevance to the prompt (1-10)
- Overall quality (1-10)
Consider the target audience and purpose when scoring.
"""
[eval.llm.output_schema]
creativity = { type = "int", min = 1, max = 10 }
engagement = { type = "int", min = 1, max = 10 }
relevance = { type = "int", min = 1, max = 10 }
overall = { type = "int", min = 1, max = 10 }
commentary = { type = "str" }
[[eval.cases]]
prompt = "Write a compelling product description for a smart home device"
score = { min = 6 }
[[eval.cases]]
prompt = "Create an engaging social media post about our company culture"
score = { min = 7 }
[[eval.cases]]
prompt = "Draft a creative email subject line for our newsletter"
score = { min = 8 }
Educational Content Assessment
[eval]
description = "Evaluate educational explanations"
type = "llm"
targets.agents = ["tutor-bot"]
targets.tools = []
[eval.llm]
model = "gpt-4"
prompt = """
Assess this educational explanation for:
- Accuracy of information (1-10)
- Clarity of explanation (1-10)
- Appropriate level for audience (1-10)
- Use of examples and analogies (1-10)
- Engagement and motivation (1-10)
Consider pedagogical effectiveness in your evaluation.
"""
[[eval.cases]]
prompt = "Explain how photosynthesis works to a 5th grader"
score = { min = 8 }
criteria = ["clarity", "appropriate_level", "examples"]
[[eval.cases]]
prompt = "What is quantum mechanics?"
score = { min = 6, max = 9 }
criteria = ["accuracy", "clarity"]
[[eval.cases]]
prompt = "How do you solve quadratic equations?"
score = { equal = 10 } # Math explanations should be perfect
criteria = ["accuracy", "examples"]
Comparative Evaluation
[eval]
description = "Compare responses against reference answers"
type = "llm"
targets.agents = ["knowledge-expert"]
targets.tools = []
[eval.llm]
model = "gpt-4"
prompt = """
Compare the agent's response to the reference answer:
- How well does it match the reference quality? (1-10)
- Does it provide equivalent or better information? (1-10)
- Is it more or less helpful than the reference? (1-10)
Score based on relative quality, not exact matching.
"""
[[eval.cases]]
prompt = "What is the company's return policy?"
reference_response = "We offer a 30-day return policy for all items in original condition with receipt. Returns can be processed in-store or by mail. Refunds are issued to the original payment method within 5-7 business days."
score = { min = 8 }
[[eval.cases]]
prompt = "How do I reset my password?"
reference_response = "To reset your password: 1) Go to the login page, 2) Click 'Forgot Password', 3) Enter your email address, 4) Check your email for a reset link, 5) Follow the link and create a new password. Contact support if you don't receive the email within 10 minutes."
score = { min = 9 }
Advanced Configuration
Custom Scoring Rubrics
[eval.llm]
model = "gpt-4"
prompt = """
Use this scoring rubric:
EXCELLENT (9-10): Response fully addresses the question with accurate, comprehensive information. Clear, professional tone. Actionable guidance provided.
GOOD (7-8): Response addresses most aspects of the question with generally accurate information. Minor gaps in completeness or clarity.
SATISFACTORY (5-6): Response addresses the basic question but may lack detail or have minor inaccuracies. Adequate but not exceptional.
NEEDS IMPROVEMENT (3-4): Response partially addresses the question but has significant gaps or inaccuracies. Unclear or unprofessional tone.
POOR (1-2): Response fails to address the question adequately. Major inaccuracies or inappropriate tone.
Provide score and specific feedback based on this rubric.
"""
Domain-Specific Evaluation
[eval]
description = "Medical information quality assessment"
type = "llm"
targets.agents = ["health-info-bot"]
targets.tools = []
[eval.llm]
model = "gpt-4"
prompt = """
Evaluate this health information response as a medical professional would:
CRITICAL FACTORS:
- Medical accuracy and evidence-based information
- Appropriate disclaimers about seeking professional medical advice
- Avoidance of specific diagnoses or treatment recommendations
- Clear, accessible language for general public
- Appropriate level of detail without overwhelming
Rate 1-10 and provide detailed medical accuracy assessment.
"""
[[eval.cases]]
prompt = "What are the symptoms of diabetes?"
score = { min = 8 }
[[eval.cases]]
prompt = "Should I take antibiotics for my cold?"
score = { min = 9 } # Should clearly advise against inappropriate antibiotic use
Multi-Model Consensus
# Use multiple LLMs for consensus evaluation
[eval]
description = "Multi-model consensus evaluation"
type = "llm"
# Primary evaluation with GPT-4
[eval.llm]
model = "gpt-4"
prompt = "Evaluate this response for overall quality (1-10)..."
# Additional evaluation configurations can be added
# to compare scores across different judge models
Best Practices
Prompt Engineering for Judges
Be specific about criteria:
prompt = """ Evaluate on these specific criteria: 1. Factual accuracy - Is the information correct? 2. Completeness - Does it answer all parts of the question? 3. Clarity - Is it easy to understand? 4. Helpfulness - Would this solve the user's problem? """Include context and examples:
prompt = """ You are evaluating customer support responses. Good responses: address the issue, provide clear steps, maintain professional tone. Poor responses: are vague, lack actionable advice, or are unprofessional. Evaluate this response (1-10):... """Use structured output:
[eval.llm.output_schema] score = { type = "int", min = 1, max = 10 } strengths = { type = "list[str]" } weaknesses = { type = "list[str]" } suggestions = { type = "str" }
Score Threshold Selection
# Conservative thresholds for critical content
score = { min = 9 } # Medical advice, legal information
# Standard thresholds for general content
score = { min = 7 } # Customer support, general Q&A
# Flexible thresholds for creative content
score = { min = 5, max = 10 } # Creative writing, brainstorming
Model Selection Strategy
# Use GPT-4 for highest quality evaluation
model = "gpt-4" # Best for complex, nuanced evaluation
# Use GPT-4-turbo for balanced quality and speed
model = "gpt-4-turbo" # Good compromise for most use cases
# Use GPT-3.5-turbo for cost-effective evaluation
model = "gpt-3.5-turbo" # Suitable for simple quality checks
Advanced Analysis
Score Distribution Analysis
LLM evaluations provide rich analytics:
- Score distributions - Understanding quality patterns
- Criteria breakdown - Performance across different dimensions
- Trend analysis - Quality improvements or degradations over time
- Comparative analysis - Performance across different agents or versions
Inter-Rater Reliability
# Use multiple judge models for reliability assessment
[eval]
description = "Multi-judge reliability test"
# Compare scores from different models to ensure consistency
Troubleshooting
Inconsistent Scores
# If LLM scores are inconsistent:
# 1. Lower temperature for more consistent judging
[eval.llm]
temperature = 0.0 # Most deterministic
# 2. Use more specific evaluation criteria
prompt = "Evaluate ONLY the factual accuracy of this response (1-10)..."
# 3. Add reference examples to the prompt
prompt = """
Score this response compared to these examples:
Excellent (10): [example of perfect response]
Good (7): [example of good response]
Poor (3): [example of poor response]
"""
Score Inflation/Deflation
# If scores are consistently too high or low:
# 1. Calibrate with reference responses
[[eval.cases]]
prompt = "Test question"
reference_response = "Known high-quality response"
score = { equal = 9 } # Calibrate judge expectations
# 2. Adjust scoring prompts
prompt = "Be critical in your evaluation. Score harshly for any deficiencies..."
# 3. Use comparative scoring
prompt = "Rate this response relative to typical customer service quality..."
Judge Model Bias
# Address potential biases in judge models:
# 1. Use multiple judge models for consensus
model = "gpt-4" # Primary judge
# Compare with Claude-3-opus, GPT-3.5-turbo
# 2. Test with known good/bad examples
[[eval.cases]]
prompt = "Known poor response example"
score = { max = 3 } # Should score low
[[eval.cases]]
prompt = "Known excellent response example"
score = { min = 9 } # Should score high
Production Considerations
Cost Management
LLM evaluations can be expensive:
# Use cost-effective models for high-volume testing
model = "gpt-3.5-turbo" # Lower cost for bulk evaluation
# Reserve premium models for critical evaluations
model = "gpt-4" # High-stakes content only
# Optimize prompt length
prompt = "Concise evaluation prompt..." # Shorter prompts reduce costs
Evaluation Frequency
# Balance thoroughness with cost:
# Critical evaluations: Every deployment
# Standard evaluations: Weekly/monthly
# Experimental evaluations: As needed
Integration with Human Review
# LLM evaluations complement human review:
# Use LLM for initial screening
score = { min = 6 } # Flag low-quality responses
# Human review for edge cases
score = { min = 4, max = 6 } # Borderline cases need human judgment
# LLM for scale, human for quality
# High-volume: LLM evaluation
# High-stakes: Human review
Next Steps
- Configuration Overview - Complete TOML reference
- Accuracy Evaluations - Complement with exact matching