Documentation

Agent CI brings continuous integration and automated testing to AI agent development. Treat your agents like the software applications they are, with Git-native versioning, automated evaluations on every pull request, and production monitoring.

New to Agent CI? Start with the Quick Start Guide to connect your repository and run your first evaluation in under 5 minutes.

Ready to build robust agents? Learn about our evaluation types including accuracy testing, performance monitoring, safety validation, and LLM-as-judge assessments.

Using a popular framework? Check out our framework integrations for LangChain, Pydantic AI, LlamaIndex, and more with zero-instrumentation setup.

Getting Started

  • Quick Start Guide
    Get up and running with Agent CI in under 5 minutes. Connect your GitHub repository and set up your first automated evaluation.

Core Concepts

  • Agents
    Understand Agent CI's core philosophy: agents are software applications, not ML models. Learn about multi-prompt architecture, Git-native versioning, and production deployment strategies.

  • Evaluations (Evals)
    Automated tests that validate agent behavior and performance. Learn about accuracy, performance, safety, consistency, and LLM-as-judge evaluation types for comprehensive agent testing.

  • Continuous Integration (CI/CD)
    Apply software engineering best practices to agent development. Learn how Agent CI automates testing, validation, and deployment through Git-based CI/CD pipelines.

Integration

  • Agent Frameworks
    Agent CI supports popular frameworks including LangChain, LangGraph, Pydantic AI, LlamaIndex, and more. Automatic detection and evaluation without code modifications.

  • LangChain
    AgentCI automatically discovers and evaluates LangChain agents and tools. Support for LangGraph create_react_agent, @tool decorator, StructuredTool, and BaseTool patterns.

  • LlamaIndex
    AgentCI automatically discovers and evaluates LlamaIndex agents and tools. Support for FunctionAgent, ReActAgent, FunctionTool.from_defaults, and plain function patterns.

  • Pydantic AI
    AgentCI automatically discovers and evaluates Pydantic AI agents and tools. Support for Agent class with model and system_prompt, plus plain function tools.

  • OpenAI Agents SDK
    AgentCI automatically discovers and evaluates OpenAI Agents SDK agents and tools. Support for Agent class with @function_tool decorator and plain functions.

  • Google ADK
    AgentCI automatically discovers and evaluates Google ADK (Agent Development Kit) agents and tools. Support for Agent, LlmAgent, FunctionTool, and plain function patterns.

  • Agno
    AgentCI automatically discovers and evaluates Agno agents and tools. Support for Agent class with OpenAI/Anthropic models, Toolkit classes, @tool decorator, and plain functions.

  • GitHub Integration
    Native GitHub integration for automated evaluation and monitoring. Install the GitHub App to run evals on pull requests and track agent performance in production.

Evaluations

  • TOML Configuration Overview
    Complete reference for Agent CI's TOML configuration format. Learn how to define evaluations, configure targets, and set up test cases in .agentci/evals/ files.

  • Accuracy Evaluations
    Test agent outputs with exact matching, substring containment, regex patterns, semantic similarity, and schema validation. Essential for deterministic tasks and API responses.

  • Performance Evaluations
    Measure response latency, token usage, and resource consumption with configurable thresholds. Ensure your agents meet production performance and cost requirements.

  • Safety Evaluations
    Validate security against prompt injection, harmful content, SQL injection, PII exposure, and jailbreaking attempts. Built-in templates for common attack vectors.

  • Consistency Evaluations
    Test output variance and behavioral reliability across multiple runs. Ensure deterministic behavior and detect regressions in agent consistency over time.

  • LLM Evaluations
    Use LLM-as-judge methodology for quality assessment. Measure helpfulness, clarity, appropriateness, and completeness with configurable scoring criteria.