Metadata-Version: 2.4
Name: eval-protocol
Version: 0.2.72
Summary: The official Python SDK for Eval Protocol (EP.) EP is an open protocol that standardizes how developers author evals for large language model (LLM) applications.
Author-email: Fireworks AI <info@fireworks.ai>
License-Expression: MIT
Project-URL: Homepage, https://github.com/fireworks-ai/eval-protocol
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.25.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: dataclasses-json>=0.5.7
Requires-Dist: uvicorn>=0.15.0
Requires-Dist: python-dotenv>=0.19.0
Requires-Dist: openai>=1.78.1
Requires-Dist: aiosqlite
Requires-Dist: aiohttp
Requires-Dist: mcp>=1.9.2
Requires-Dist: PyYAML>=5.0
Requires-Dist: hydra-core>=1.3.2
Requires-Dist: omegaconf>=2.3.0
Requires-Dist: httpx>=0.24.0
Requires-Dist: anthropic>=0.59.0
Requires-Dist: litellm<1.75.0
Requires-Dist: pytest>=6.0.0
Requires-Dist: pytest-asyncio>=0.21.0
Requires-Dist: peewee>=3.18.2
Requires-Dist: backoff>=2.2.0
Requires-Dist: questionary>=2.0.0
Requires-Dist: toml>=0.10.0
Requires-Dist: loguru>=0.6.0
Requires-Dist: docstring-parser>=0.15
Requires-Dist: rich>=12.0.0
Requires-Dist: psutil>=5.8.0
Requires-Dist: addict>=2.4.0
Requires-Dist: deepdiff>=6.0.0
Requires-Dist: websockets>=15.0.1
Requires-Dist: fastapi>=0.116.1
Provides-Extra: dev
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: pytest-httpserver; extra == "dev"
Requires-Dist: werkzeug>=2.0.0; extra == "dev"
Requires-Dist: ruff>=0.5.0; extra == "dev"
Requires-Dist: transformers>=4.0.0; extra == "dev"
Requires-Dist: pandas>=1.5.0; extra == "dev"
Requires-Dist: types-setuptools; extra == "dev"
Requires-Dist: types-requests; extra == "dev"
Requires-Dist: types-PyYAML; extra == "dev"
Requires-Dist: types-docker; extra == "dev"
Requires-Dist: versioneer>=0.20; extra == "dev"
Requires-Dist: openai>=1.78.1; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: e2b; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: pytest-xdist; extra == "dev"
Requires-Dist: docker==7.1.0; extra == "dev"
Requires-Dist: ipykernel>=6.30.0; extra == "dev"
Requires-Dist: jupyter>=1.1.1; extra == "dev"
Requires-Dist: pip>=25.1.1; extra == "dev"
Requires-Dist: haikus==0.3.8; extra == "dev"
Requires-Dist: syrupy>=4.0.0; extra == "dev"
Requires-Dist: gymnasium>=1.2.0; extra == "dev"
Provides-Extra: trl
Requires-Dist: torch>=1.9; extra == "trl"
Requires-Dist: trl>=0.7.0; extra == "trl"
Requires-Dist: peft>=0.7.0; extra == "trl"
Requires-Dist: transformers>=4.0.0; extra == "trl"
Requires-Dist: accelerate>=0.28.0; extra == "trl"
Provides-Extra: openevals
Requires-Dist: openevals>=0.1.0; extra == "openevals"
Provides-Extra: fireworks
Requires-Dist: fireworks-ai>=0.19.19; extra == "fireworks"
Provides-Extra: box2d
Requires-Dist: swig; extra == "box2d"
Requires-Dist: gymnasium[box2d]>=0.29.0; extra == "box2d"
Requires-Dist: Pillow; extra == "box2d"
Provides-Extra: langfuse
Requires-Dist: langfuse>=2.0.0; extra == "langfuse"
Provides-Extra: huggingface
Requires-Dist: datasets>=3.0.0; extra == "huggingface"
Requires-Dist: transformers>=4.0.0; extra == "huggingface"
Provides-Extra: langsmith
Requires-Dist: langsmith>=0.1.86; extra == "langsmith"
Provides-Extra: bigquery
Requires-Dist: google-cloud-bigquery>=3.0.0; extra == "bigquery"
Requires-Dist: google-auth>=2.0.0; extra == "bigquery"
Provides-Extra: svgbench
Requires-Dist: selenium>=4.0.0; extra == "svgbench"
Provides-Extra: pydantic
Requires-Dist: pydantic-ai>=1.0.2; extra == "pydantic"
Provides-Extra: supabase
Requires-Dist: supabase>=2.18.1; extra == "supabase"
Provides-Extra: chinook
Requires-Dist: psycopg2-binary>=2.9.10; extra == "chinook"
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.3.0; extra == "langchain"
Provides-Extra: braintrust
Requires-Dist: braintrust[otel]; extra == "braintrust"
Provides-Extra: langgraph
Requires-Dist: langgraph>=0.6.7; extra == "langgraph"
Requires-Dist: langchain-core>=0.3.75; extra == "langgraph"
Provides-Extra: langgraph-tools
Requires-Dist: langgraph>=0.6.7; extra == "langgraph-tools"
Requires-Dist: langchain>=0.3.0; extra == "langgraph-tools"
Requires-Dist: langchain-fireworks>=0.3.0; extra == "langgraph-tools"
Provides-Extra: proxy
Requires-Dist: redis>=5.0.0; extra == "proxy"
Requires-Dist: langfuse>=2.0.0; extra == "proxy"
Requires-Dist: uuid6>=2025.0.0; extra == "proxy"
Dynamic: license-file

# Eval Protocol (EP)

[![PyPI - Version](https://img.shields.io/pypi/v/eval-protocol)](https://pypi.org/project/eval-protocol/)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/eval-protocol/python-sdk)

**Stop guessing which AI model to use. Build a data-driven model leaderboard.**

With hundreds of models and configs, you need objective data to choose the right one for your use case. EP helps you evaluate real traces, compare models, and visualize results locally.

## 🚀 Features

- **Pytest authoring**: `@evaluation_test` decorator to configure evaluations
- **Robust rollouts**: Handles flaky LLM APIs and parallel execution
- **Integrations**: Works with Langfuse, LangSmith, Braintrust, Responses API
- **Agent support**: LangGraph and Pydantic AI
- **MCP RL envs**: Build reinforcement learning environments with MCP
- **Built-in benchmarks**: AIME, tau-bench
- **LLM judge**: Stack-rank models using pairwise Arena-Hard-Auto
- **Local UI**: Pivot/table views for real-time analysis

## ⚡ Quickstart (no labels needed)

Install with your tracing platform extras and set API keys:

```bash
pip install 'eval-protocol[langfuse]'

# Model API keys (set what you need)
export OPENAI_API_KEY=...
export FIREWORKS_API_KEY=...
export GEMINI_API_KEY=...

# Platform keys
export LANGFUSE_PUBLIC_KEY=...
export LANGFUSE_SECRET_KEY=...
export LANGFUSE_HOST=https://your-deployment.com  # optional
```

Minimal evaluation using the built-in AHA judge:

```python
from datetime import datetime
import pytest

from eval_protocol import (
    evaluation_test,
    aha_judge,
    EvaluationRow,
    SingleTurnRolloutProcessor,
    DynamicDataLoader,
    create_langfuse_adapter,
)


def langfuse_data_generator() -> list[EvaluationRow]:
    adapter = create_langfuse_adapter()
    return adapter.get_evaluation_rows(
        to_timestamp=datetime.utcnow(),
        limit=20,
        sample_size=5,
    )


@pytest.mark.parametrize(
    "completion_params",
    [
        {"model": "openai/gpt-4.1"},
        {"model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b"},
    ],
)
@evaluation_test(
    data_loaders=DynamicDataLoader(generators=[langfuse_data_generator]),
    rollout_processor=SingleTurnRolloutProcessor(),
)
async def test_llm_judge(row: EvaluationRow) -> EvaluationRow:
    return await aha_judge(row)
```

Run it:

```bash
pytest -q -s
```

The pytest output includes local links for a leaderboard and row-level traces (pivot/table) at `http://localhost:8000`.

## Installation

This library requires Python >= 3.10.

### pip

```bash
pip install eval-protocol
```

### uv (recommended)

```bash
# Install uv (if needed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Add to your project
uv add eval-protocol
```

## 📚 Resources

- **[Documentation](https://evalprotocol.io)** – Guides and API reference
- **[Discord](https://discord.com/channels/1137072072808472616/1400975572405850155)** – Community
- **[GitHub](https://github.com/eval-protocol/python-sdk)** – Source and examples

## License

[MIT](LICENSE)
