Let's Start With Straightforward Agent Code

Let's take this example of a bank support agent built with Pydantic AI:

from pydantic_ai import Agent, RunContext

class BankSupportDependencies:
    def __init__(self, customer_id: str, db_client):
        self.customer_id = customer_id
        self.db = db_client

async def customer_balance(
    ctx: RunContext[BankSupportDependencies],
    include_pending: bool
) -> float:
    """Returns the customer's current account balance."""
    return await ctx.deps.db.customer_balance(
        id=ctx.deps.customer_id,
        include_pending=include_pending
    )

support_agent = Agent(
    'openai:gpt-4',
    deps_type=BankSupportDependencies,
    tools=[customer_balance],
    instructions='You are a support agent in our bank...'
)

Clear business logic, readable structure, and easy to understand. This is exactly what agent code should look like.

Now Let's Add "Enterprise-Grade" Observability

But wait - we need monitoring, evaluation, and prompt tracking for production, right? Let's see what happens when we integrate with an industry-leading observability platform:

+ # Environment variables you now need to manage:
+ # PHOENIX_COLLECTOR_ENDPOINT="http://localhost:6006/v1/traces"
+ # PHOENIX_PROJECT_NAME="your-project"
+ # PHOENIX_API_KEY="your-api-key"
+ # ARIZE_SPACE_ID="your-space-id"
+ # ARIZE_API_KEY="your-arize-key"
+
+ from phoenix.otel import register
+ from arize.otel import register as arize_register
+ from openinference.instrumentation.openai import OpenAIInstrumentor
+ import phoenix as px
+ from phoenix.client.types import PromptVersion
+
  from pydantic_ai import Agent, RunContext

+ # Instrumentation setup
+ tracer_provider = register(
+     auto_instrument=True,
+     batch=True,
+     project_name="bank-support"
+ )
+
+ arize_register(
+     space_id="your-space-id",
+     api_key="your-api-key",
+     project_name="bank-support"
+ )
+
+ OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
+
+ # External prompt management
+ bank_prompt = px.Client().prompts.create(
+     name="bank-support-agent",
+     version=PromptVersion(
+         [{"role": "system", "content": "You are a support agent in our bank..."}],
+         model_name="gpt-4",
+     ),
+ )
+
+ prompt_version = px.Client().prompts.get(name="bank-support-agent")
+
  class BankSupportDependencies:
      def __init__(self, customer_id: str, db_client):
          self.customer_id = customer_id
          self.db = db_client

  async def customer_balance(
      ctx: RunContext[BankSupportDependencies],
      include_pending: bool
  ) -> float:
      """Returns the customer's current account balance."""
      return await ctx.deps.db.customer_balance(
          id=ctx.deps.customer_id,
          include_pending=include_pending
      )

  support_agent = Agent(
      'openai:gpt-4',
      deps_type=BankSupportDependencies,
      tools=[customer_balance],
-     instructions='You are a support agent in our bank...'
+     instructions=prompt_version.messages[0]["content"]
  )

Look what happened to our straightforward code:

  • 25+ new lines of instrumentation setup
  • 5 new imports just for monitoring
  • Prompt management now requires external API calls
  • Environment variables to manage across environments
  • Your agent logic buried under infrastructure concerns

I Miss The Simple Version

You know what? I kind of miss the simplicity of that original implementation. The clean separation of concerns. The focus on what the agent actually does instead of how it's monitored.

Well, here's the thing: Agent CI lets you keep that original implementation without modification and gain all the benefits.

The Hidden Cost of Instrumentation

Look at the difference: 30+ lines of setup, environment variables, and prompt management code before you even start building your agent. Your codebase becomes 55% business logic, 45% instrumentation overhead.

This is simply not as elegant or maintainable. We've taken clean, focused code and buried it under layers of monitoring infrastructure. The simplicity of that original agent implementation is lost. What was once easy to read, understand, and modify has become a complex system where your actual business logic is scattered between your agent code and external platform APIs.

What We Forgot: Separation of Concerns

In regular software development, we learned this decades ago:

  • Tests don't live in production code
  • Logging is configurable, not embedded
  • Monitoring is infrastructure, not application logic

Why did we abandon these principles for agents?

Imagine if we tested regular functions like this:

def calculate_total(items):
    test_harness.start_test("calculate_total")
    total = 0
    for item in items:
        test_harness.track_iteration(item)
        total += item.price
    test_harness.end_test(total)
    return total

Nobody does this. We write clean functions and test them separately:

app/shopping_cart.py - Clean business logic:

def calculate_total(items):
    return sum(item.price for item in items)

tests/test_shopping_cart.py - Simple tests:

def test_calculate_total():
    items = [Item(price=10), Item(price=20)]
    assert calculate_total(items) == 30

Agent evaluation should work the same way.

How Agent CI Preserves Your Straightforward Code

Remember that original, clean implementation? Agent CI lets you keep it exactly as-is while providing all the observability you need.

Start with your straightforward code (unchanged):

from pydantic_ai import Agent, RunContext

class BankSupportDependencies:
    def __init__(self, customer_id: str, db_client):
        self.customer_id = customer_id
        self.db = db_client

async def customer_balance(
    ctx: RunContext[BankSupportDependencies],
    include_pending: bool
) -> float:
    """Returns the customer's current account balance."""
    return await ctx.deps.db.customer_balance(
        id=ctx.deps.customer_id,
        include_pending=include_pending
    )

support_agent = Agent(
    'openai:gpt-4',
    deps_type=BankSupportDependencies,
    tools=[customer_balance],
    instructions='You are a support agent in our bank...'
)

Agent CI automatically discovers everything it needs from your existing code:

Discovery Value Source
Agent name support_agent Variable name
Model openai:gpt-4 Agent constructor
Tools customer_balance Tools array
Dependencies BankSupportDependencies Type annotation
Instructions "You are a support agent..." Instructions parameter
Version Git commit hash Git repository

Agent CI Evaluation Format

Instead of instrumenting your code, you define evaluations externally using simple TOML configuration:

.agentci/bank-support-accuracy.toml

[eval]
description = "Bank support agent accuracy tests"
type = "accuracy"
targets.agents = ["support_agent"]

[[eval.cases]]
prompt = "What's my account balance?"
context = { customer_id = "12345", include_pending = true }
output = "Your current balance is {{*}}"

Your evaluations are versioned with your code, but completely separate from your agent implementation.

The Bottom Line

Your codebase should express what your agents do, not how they're monitored.

Remember that straightforward agent code we started with? That's exactly what your codebase should look like in production.

Agent CI proves you don't have to choose between enterprise-grade observability and clean code. You can have both. Keep your agents clean, focused, and maintainable while getting all the monitoring and evaluation you need.

Stop writing instrumentation. Start writing agents that you actually enjoy maintaining.