The Problem

Agents are non-deterministic. The same prompt produces different phrasings on different runs. String matching is the traditional approach in testing software, but it doesn't work for agents:

expected = "Card charged $20"
actual = "We billed the card for twenty dollars."
assert expected == actual  # ???

This was obviously never going to work.

The Solution

The actual requirement: pass if semantic similarity is close enough to be meaningful.

import embedsim

score = embedsim.pairsim(expected, actual)  # 0.89
assert score >= 0.8  # True

Design Constraints

We needed:

  • Outlier detection: Flag semantically divergent items in collections
  • Deterministic scoring: Same inputs → same score
  • Intuitive thresholds: Logical numbers that are easy to reason about
  • Easily swappable model back-ends: Experiment with different embedding models
  • Online and offline model support: Quick integration of API-based and local models
  • No database: Cold-start fast, CI-friendly
  • Straightforward, elegant API: Easy to understand and maintain

Implementation Choices

  • Cosine over Euclidean: Scale-invariant similarity
  • Dual-mode architecture: OpenAI API or local sentence-transformers
  • MAD for outliers: Median absolute deviation guards against drift
  • Compact & Maintainable: Minimal surface area, maximal test coverage

Mathematical Approach

Embedding and Similarity

Each text is converted to a high-dimensional vector (embedding) that captures semantic meaning. Similarity is measured as the cosine of the angle between two vectors:

similarity = (A · B) / (||A|| × ||B||)

Score ranges from 0 (orthogonal, unrelated) to 1 (parallel, semantically identical).

Why Cosine Over Euclidean

We started with Euclidean distance. It broke immediately: longer texts scored as less similar even when semantically identical. The problem was magnitude—longer embeddings have larger Euclidean distances regardless of semantic direction.

Cosine similarity measures the angle between vectors, not their length. Two vectors pointing in the same direction score identically whether one is twice as long. "The cat sat" and "The cat sat on the mat" point in similar semantic directions even though one is longer. Direction matters more than magnitude for semantic similarity.

Outlier Detection

For group coherence, we compute the centroid (mean vector) of all embeddings, then measure each text's distance from the centroid. Outliers are identified using Median Absolute Deviation (MAD):

MAD = median(|xi - median(x)|)
outlier_threshold = median(x) - k × MAD

We initially used standard deviation for thresholds. That failed in an obvious-in-retrospect way: standard deviation is pulled by the outliers we're trying to detect. MAD is robust because it uses the median—the outliers don't distort the calculation.

Thresholds Should be Intuitive

Thresholds are intended to be interpreted. A reliable, logical system would hopefully conform to this logic:

  • 0.9: Near-duplicate detection
  • 0.8: Semantic equivalence (paraphrasing)
  • 0.7: Topical similarity
  • Below 0.7: Weak or no semantic relationship

Model Selection

We assumed larger models would perform better across the board. Benchmarking 6 embedding models across 11 test scenarios revealed that wasn't true: coherent texts, mixed topics, outlier detection, and length scaling each favor different characteristics.

The highest coherence scores came from intfloat/e5-large-v2 (0.4983 average), but it failed 4 out of 11 mixed topic tests. The perfect 11/11 pass rate went to openai/text-embedding-3-small, despite lower raw scores. Different models optimize for different tradeoffs.

For API-based Processing

openai/text-embedding-3-small achieved a perfect 11/11 pass rate and excels at distinguishing mixed topics. Best topic discrimination at $0.02 per million tokens.

openai/text-embedding-3-large delivers higher accuracy with 3,072 dimensions (vs 1,536). 54.9% on MIRACL vs 31.4% for ada-002. Premium cost at $0.13/M tokens.

For Local Processing

BAAI/bge-large-en-v1.5 posted the best local pass rate at 8/11 with consistent performance across diverse content. Most reliable baseline for general-purpose use op an open weights model.

intfloat/e5-large-v2 achieved the highest coherence scores (avg 0.4983) with lowest variance (0.0114). Best when texts should score high similarity, but struggles with mixed topics (failed 4/11 tests).

sentence-transformers/all-MiniLM-L6-v2 runs fast at 384 dimensions for rapid prototyping. Higher variance (0.0724) and 256 token limit make it less suitable for production.

jinaai/jina-embeddings-v2-base-en supports 8,192 token context for long documents. Lower coherence scores (0.3750) and outlier detection issues in benchmarks, likely related to the focus on short-form content in our benchmarks.

Full Results

Complete benchmark data including all test scenarios, pass/fail analysis, and performance metrics: MODELS.md

Benchmark Data

We started with 45 text samples organized into 9 categories (3 sizes × 3 content types: coherent, mixed, outlier):

  • 6 embedding models benchmarked
  • 11 test scenarios covering coherence detection, topic distinction, and outlier identification
  • 66 test executions (11 scenarios × 6 models)
  • 270+ individual embeddings computed and evaluated

This is a baseline start, not comprehensive coverage. The benchmarks show expected performance under these specific conditions—short-form English technical text. Different domains, languages, or content types will likely behave differently.

We're still learning what makes a model perform well. Full benchmark data and methodology: MODELS.md

Usage Patterns

Comparing Two Strings

import embedsim

score = embedsim.pairsim(
    "The cat sat on the mat",
    "A feline rested on the rug"
)  # 0.89

Comparing a Group of Strings

import embedsim

texts = [
    "Python is a programming language",
    "JavaScript is used for web development",
    "Ruby on Rails is a web framework",
    "The weather is nice today"  # outlier
]

scores = embedsim.groupsim(texts)

Configurable Models

import embedsim

embedsim.config.model = "openai/text-embedding-3-large"

Prior Art

  • sentence-transformers: Made local embedding models accessible and practical
  • Information retrieval cosine baseline: Decades of research on semantic similarity
  • Small libraries philosophy: Do one thing, do it well; shareable and composable

Contributing

Repository: github.com/Agent-CI/embedsim

If you find a case where semantic similarity ≥ threshold still fails, we want to know. Open an issue with the two texts, your expected outcome, and the actual score. We'll add it to the benchmark suite.

If this solves a problem for you, or if you're using it in an interesting way, share what you're building. The more use cases we understand, the better we can test edge cases and guide model selection.

Why This Exists

Agent CI needs semantic assertions for testing agent outputs. Vector databases add latency and complexity for simple comparisons. We needed to measure semantic similarity with a function call.

The benchmarking process taught us more than expected. Highest coherence scores don't correlate with best overall reliability. Topic distinction requires different characteristics than coherence detection. Model selection depends on your use case—there's no universal "best."

Building this and running systematic benchmarks exposed performance boundaries that model documentation doesn't reveal. We learned when models fail, not just when they succeed. That understanding shapes how we use them. The benchmark infrastructure makes it straightforward to keep learning: add new models, test new scenarios, validate changes without breaking what works.