Product Roadmap

Shipped Features

Production ReadyBeta

LangChain Framework Support

Automatically detects and evaluates LangChain agents and tools with built-in understanding of LangChain patterns.

LlamaIndex Framework Support

RAG-focused evaluation patterns and document retrieval validation for LlamaIndex applications.

Pydantic AI Framework Support

Native support for Pydantic AI agents with type-safe evaluation execution and validation.

OpenAI Agents SDK Support

Native OpenAI agent evaluation with built-in understanding of OpenAI's agent conventions.

Google ADK Support

Google agent framework support with cloud integration capabilities for Gemini models.

Agno Framework Support

Multi-model agent framework support for OpenAI and Anthropic models.

Server-Side Evaluation Processing

All evaluations run server-side for consistency and scalability, with CLI acting as a coordination layer.

Git-Based Configuration Management

Evaluation configurations stored in `.agentplatform/` folder within your repository, version-controlled alongside your code.

GitHub Repository Connection

Native GitHub integration with repository selection, branch tracking, and commit monitoring.

Basic Agent and Tool Evaluations

Two-tier evaluation system focusing on agent-level behavior and individual tool validation.

LLM-as-Judge Evaluations

Objective scoring of agent response quality, coherence, and helpfulness with threshold visualization.

Embedding Similarity Comparisons

Semantic drift detection in agent responses with baseline comparison and similarity scoring.

Return Value Validation

Automated validation of agent return values against expected formats and data types.

JSON Schema Validation

Schema validation for structured agent outputs with detailed error reporting and compliance checking.

Consistency Testing

Multi-run consistency analysis to ensure agent behavior stability across multiple executions.

Dashboard Overview

Main dashboard with system health metrics, real-time activity stream, and sparkline trend visualization.

Environment Lifecycle Management

Environment overview with production, staging, feature branch, and PR environment tracking including agent/prompt counts per environment.

Agent Detail Views

Version-aware agent viewing with commit hash support and performance tracking across deployments.

Prompt Versioning System

Prompt detail views with complete version history and commit hash navigation for tracking prompt evolution.

Prompt Diff Viewer

Side-by-side comparison of prompt versions with syntax highlighting for identifying changes between versions.

Tool Configuration Management

Tool detail views with version tracking and usage analytics for development tool integration.

Evaluation Result Visualization

Score visualization with thresholds, historical tracking, and detailed result analysis interfaces.

Run History Tracking

Agent run monitoring with detailed execution logs, configuration metadata, and performance analysis.

Run Filtering System

Advanced filtering capabilities by status, branch, date range, and performance metrics for run analysis.

Pull Request Integration

PR-specific environment creation and tracking with evaluation results linked to pull request workflows.

Application Switcher

Multi-application support with environment-aware switching and status indicators for active, maintenance, and inactive applications.

Role-Based Access Control

User permission system with reviewer, developer, and administrator roles for team collaboration.

Custom Evaluation Scripts

Framework for defining and running project-specific evaluation criteria beyond built-in tests.

Manual Feedback Collection

User feedback integration system for collecting and analyzing manual evaluation results.

User Activity Tracking

Comprehensive logging and monitoring of user actions across the platform for audit and analytics.

Environment Variable Sets Management

Flexible system for creating and managing environment variable collections with smart default assignment to Git-based environments and cloning capabilities.

Coming Soon

Next UpPlanningResearch

CLI Development Interface

Next Up

Lightweight command-line tool for running evaluations, authentication, and viewing results from your terminal.

Advanced CI Pipeline Integration

Next Up

Enhanced automated evaluation comments on GitHub PRs with deployment blocking and detailed regression analysis. Automatically prevent merges when evaluations fail.

Out-of-the-Box Security Testing

Next Up

Built-in prompt injection and SQL injection testing to validate agent security without additional configuration. Protect against common vulnerabilities automatically.

Cross-Environment Performance Comparison

Next Up

Compare agent behavior and performance between development, staging, and production environments. Identify environment-specific issues before deployment.

Git Stash Development Evaluation

Next Up

Evaluate uncommitted development code using temporary Git refs without requiring formal commits. Test work-in-progress without cluttering commit history.

Usage Analytics and Monitoring

Planning

Real-time tracking of agent performance, token usage, and cost metrics across deployments. Understand your AI spending at a glance.

Automated Regression Detection

Planning

Block deployments when evaluations detect regressions in existing agent functionality. Never accidentally deploy a degraded agent again.

Visual Dashboard Interface

Planning

Web-based insights and trend tracking for stakeholders and non-technical team members. Share progress with product managers effortlessly.

A/B Testing Framework

Planning

Compare different agent versions and configurations with statistical significance testing. Make data-driven decisions about agent improvements.

Baseline Commit Comparison

Planning

Compare current agent performance against specific Git commits or production branches. Track improvements over time quantitatively.

Semantic Kernel Framework Support

Research

Cross-platform agent support for Microsoft's Semantic Kernel framework. Evaluate .NET and Python agents seamlessly.

CrewAI Framework Support

Research

Multi-agent evaluation coordination for CrewAI team-based agent workflows. Test agent collaboration and delegation patterns.

Azure AI Agents Support

Research

Azure-integrated evaluation for Microsoft's cloud-native agent platform. Leverage Azure infrastructure for scalable testing.

Multi-Team Collaboration Features

Research

Shared evaluation standards and team coordination tools for larger development organizations. Scale agent development across departments.

Compliance Reporting

Research

Automated security and performance compliance reports for enterprise and regulated environments. Meet audit requirements automatically.

Pytest-Style Test Selection

Research

Run specific evaluations using double-colon syntax (e.g., `agents::user_assistant::security`). Target exactly what you want to test.

Enterprise SSO Integration

Research

Single sign-on support for large organizations with existing identity management systems. Seamless authentication for enterprise teams.

Self-Hosted Deployment Options

Research

On-premises deployment capabilities for organizations with strict data governance requirements. Keep sensitive data within your infrastructure.

Context Usage Tracking

Research

Track the percentage of context used in live runs and evaluation runs to optimize agent memory usage and prevent context overflow issues.

Agent Interaction Audit Trails

Research

Comprehensive audit logging for every agent interaction with selective evaluation capabilities for compliance and debugging purposes.

Data Leak Detection Evaluations

Research

Automated detection of potential data leaks between users, including jailbreak attempts that could expose unauthorized user information.

Sensitive Content Detection

Research

Out-of-the-box evaluation for detecting expletives, hate speech, and off-topic jailbreak attempts without requiring unsavory content in your repository.

AI-Powered Configuration Assistant

Research

Custom agentic system that analyzes existing codebases and generates initial evaluation configurations for one-click project setup.

Python-Native Evaluation Configuration

Research

Define evaluation configurations using pure Python instead of TOML for more powerful and dynamic configuration. Enables programmatic eval generation, conditional logic, and better IDE support for Python-first teams.

Agent Performance Report Export

Research

Export shareable performance reports for agents with evaluation results, metrics, and trends for stakeholder presentations and documentation.