

<p align="center">
  <a href="LICENSE"><img src="https://img.shields.io/github/license/isathish/LLMEvaluationFramework?style=for-the-badge" alt="License"></a>
  <a href="https://github.com/isathish/LLMEvaluationFramework/actions"><img src="https://img.shields.io/github/actions/workflow/status/isathish/LLMEvaluationFramework/python-app.yml?style=for-the-badge" alt="Build Status"></a>
  <a href="https://pypi.org/project/LLMEvaluationFramework/"><img src="https://img.shields.io/pypi/v/llm-evaluation-framework?style=for-the-badge" alt="PyPI Version"></a>
  <a href="https://pypi.org/project/LLMEvaluationFramework/"><img src="https://img.shields.io/pypi/dm/llm-evaluation-framework?style=for-the-badge" alt="Downloads"></a>
  <a href="https://pypi.org/project/LLMEvaluationFramework/"><img src="https://img.shields.io/badge/PyPI-LLMEvaluationFramework-blue?style=for-the-badge&logo=pypi" alt="View on PyPI"></a>
</p>

<h1 align="center">🚀 LLMEvaluationFramework</h1>

<p align="center"><b>A Modern, Modular, and Extensible Framework for Evaluating Large Language Models</b></p>

<p align="center">
  <img src="https://img.shields.io/badge/Status-Active-success?style=for-the-badge">
  <img src="https://img.shields.io/badge/Maintained-Yes-brightgreen?style=for-the-badge">
  <img src="https://img.shields.io/badge/Contributions-Welcome-blue?style=for-the-badge">
</p>


<p align="center">
  <img src="https://img.shields.io/badge/LLM-Evaluation-blueviolet?style=for-the-badge&logo=python" alt="Framework Badge">
  <img src="https://img.shields.io/badge/AI-Powered-orange?style=for-the-badge&logo=artstation" alt="AI Badge">
</p>

<p align="center">
  <a href="https://isathish.github.io/llmevaluationframework/">
    <img src="https://img.shields.io/badge/Docs-Online-success?style=for-the-badge&logo=readthedocs" alt="Docs Badge">
  </a>
  <a href="https://pypi.org/project/LLMEvaluationFramework/">
    <img src="https://img.shields.io/pypi/v/llm-evaluation-framework?style=for-the-badge&logo=python" alt="PyPI Version">
  </a>
  <a href="https://opensource.org/licenses/MIT">
    <img src="https://img.shields.io/badge/License-MIT-green?style=for-the-badge" alt="License">
  </a>
  <a href="https://github.com/isathish/LLMEvaluationFramework/stargazers">
    <img src="https://img.shields.io/github/stars/isathish/LLMEvaluationFramework?style=for-the-badge&logo=github" alt="Stars">
  </a>
</p>


<p align="center">
  <i>Advanced Python framework for evaluating, testing, and benchmarking Large Language Models (LLMs)</i><br>
  <a href="https://isathish.github.io/llmevaluationframework/"><b>📚 View Full Documentation</b></a>
</p>

<p align="center">
  <a href="https://pypi.org/project/LLMEvaluationFramework/"><img src="https://img.shields.io/pypi/v/llm-evaluation-framework?color=blue&label=PyPI&logo=python" alt="PyPI"></a>
  <a href="https://github.com/isathish/LLMEvaluationFramework/actions"><img src="https://img.shields.io/github/actions/workflow/status/isathish/LLMEvaluationFramework/python-app.yml?label=CI&logo=github" alt="Build Status"></a>
  <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License"></a>
  <a href="https://github.com/isathish/LLMEvaluationFramework/stargazers"><img src="https://img.shields.io/github/stars/isathish/LLMEvaluationFramework?style=social" alt="Stars"></a>
</p>

---
## ✨ Highlights

| Feature | Description |
|---------|-------------|
| 🚀 **Fast** | Async inference engine for high-throughput evaluations |
| 🧩 **Modular** | Plug-and-play architecture for engines, datasets, and scoring |
| 📊 **Insightful** | Rich scoring strategies and analytics |
| 💾 **Persistent** | Store results in JSON or databases |
| 🛠 **Developer-Friendly** | CLI tools, API, and full test coverage |


## 🌟 At a Glance

<p align="center">
  <img src="https://img.shields.io/badge/Status-Active-success?style=flat-square">
  <img src="https://img.shields.io/badge/Maintained-Yes-brightgreen?style=flat-square">
  <img src="https://img.shields.io/badge/Contributions-Welcome-blue?style=flat-square">
</p>

<div align="center">

**LLMEvaluationFramework** is your **all-in-one** solution for **evaluating, testing, and benchmarking** Large Language Models (LLMs) with style and precision.

</div>

---

> **LLMEvaluationFramework** is your **all-in-one** solution for **evaluating, testing, and benchmarking** Large Language Models (LLMs) with style and precision.


---
## 📖 Overview

> **Beautifully Designed & Developer Friendly** — LLMEvaluationFramework combines **powerful evaluation tools** with a **modern, visually appealing documentation style** to make your LLM benchmarking experience smooth and enjoyable.

**LLMEvaluationFramework** is a **production-grade** toolkit for **evaluating, testing, and benchmarking LLMs**.  
It provides a modular architecture with model inference, automated suggestions, model registry management, and synthetic dataset generation — all in one package.

### Why Choose LLMEvaluationFramework?
- **Comprehensive Evaluation**: Supports multiple evaluation strategies including accuracy, relevance, and custom metrics.
- **Scalable**: Handles both small-scale experiments and large-scale benchmarking with ease.
- **Extensible**: Add new models, datasets, and scoring strategies without modifying core code.
- **Developer-Friendly**: Well-documented APIs, CLI tools, and examples for quick onboarding.
- **Production-Ready**: Built with testing, logging, and persistence in mind.

### Use Cases
- Benchmarking multiple LLMs for research or production.
- Generating synthetic datasets for training or evaluation.
- Automating prompt optimization workflows.
- Integrating LLM evaluation into CI/CD pipelines.

---

## ✨ Key Features

Below is a detailed breakdown of the framework's capabilities:

> A modular, extensible, and testable framework designed for both research and production environments.



| 🚀 Feature | 💡 Description |
|------------|----------------|
| ⚡ **Model Inference Engine** | Evaluate prompts against multiple LLMs with ease |
| 💡 **Auto Suggestion Engine** | Generate intelligent prompt suggestions |
| 📚 **Model Registry** | Manage and register multiple LLM configurations |
| 🧪 **Test Dataset Generator** | Create synthetic datasets for evaluation |
| 🔌 **Extensible** | Easily integrate with new models and datasets |
| ✅ **Testable** | Designed with 100% test coverage in mind |



### 🆕 Latest Additions
- 🚀 **Async Inference Engine** — Concurrent model evaluations for faster benchmarking.
- 📏 **Custom Scoring Strategies** — Plug in your own evaluation metrics.
- 💾 **Persistent Storage** — JSON/DB backends for saving configurations and results.
- 🖥 **CLI Support** — Manage models and run evaluations from the terminal.
- 📜 **Enhanced Logging** — Detailed logs for debugging and performance tracking.

---

## 📦 Installation

<p align="center">
  <img src="https://img.shields.io/badge/Install%20in-2%20Minutes-success?style=for-the-badge">
</p>

The framework can be installed in multiple ways depending on your needs:
- **PyPI**: For stable releases.
- **Source**: For the latest development version.
- **Custom Fork**: Clone and modify for internal use.

<details>
<summary>💻 Click to Expand Installation Steps</summary>

> **Tip:** Use a virtual environment to keep dependencies isolated.


**From PyPI**
```bash
pip install llm-evaluation-framework
```
</details>

**From Source**
```bash
git clone https://github.com/isathish/LLMEvaluationFramework.git
cd LLMEvaluationFramework
pip install -e .[dev]
```

---

## 📚 Table of Contents
- [Overview](#-overview)
- [Key Features](#-key-features)
- [Installation](#-installation)
- [Quick Start](#-quick-start)
- [Workflow Overview](#-workflow-overview)
- [Advanced Configuration](#-advanced-configuration)
- [Project Structure](#-project-structure)
- [Documentation](#-documentation)
- [Contributing](#-contributing)
- [License](#-license)

---

## 🚀 Quick Start

<p align="center">
  <img src="https://img.shields.io/badge/Start%20Now-Fast%20Track-blue?style=for-the-badge">
</p>

We now provide **comprehensive runnable examples** in the [`examples/`](examples/) folder covering all major use cases:

| Example Script | Description |
|----------------|-------------|
| `basic_usage.py` | Single model registration, dataset generation, synchronous evaluation, and result printing |
| `advanced_async_usage.py` | Multiple model registration, async evaluation for faster benchmarking, and result comparison |
| `custom_scoring_and_persistence.py` | Custom keyword-based scoring strategy, database persistence, and querying stored results |
| `cli_usage.py` | Full CLI workflow for registering models, generating datasets, running evaluations, and viewing results |
| `dataset_generation_and_analysis.py` | Multi-topic dataset generation, saving to files, loading, and performing basic dataset statistics |
| `model_recommendation.py` | Multi-model evaluation and ranking using the AutoSuggestionEngine |
| `error_handling_and_logging.py` | Configuring logging, handling API errors gracefully, and logging evaluation progress/results |
| `full_pipeline_demo.py` | Complete workflow combining multiple models, datasets, sync & async evaluations, custom scoring, JSON & DB persistence, recommendations, and logging |

> **Tip:** These examples are a great starting point to understand the framework in action.

This section demonstrates how to get started quickly with minimal setup.

<details>
<summary>📌 Show Quick Start Examples</summary>

> **Pro Tip:** Explore the [Usage Guide](docs/usage.md) for advanced workflows.

---

<details>
<summary>🔍 Model Inference</summary>

```python
from llm_evaluation_framework import ModelInferenceEngine

engine = ModelInferenceEngine(model_name="gpt-4")
result = engine.evaluate("What is the capital of France?")
print(result)
```
</details>
</details>

<details>
<summary>💡 Auto Suggestions</summary>

```python
from llm_evaluation_framework import AutoSuggestionEngine

suggestion_engine = AutoSuggestionEngine(model_name="gpt-4")
suggestions = suggestion_engine.suggest("Write a poem about the ocean.")
print(suggestions)
```
</details>

<details>
<summary>📚 Model Registry</summary>

```python
from llm_evaluation_framework import ModelRegistry

ModelRegistry.register("gpt-4", {"provider": "OpenAI", "max_tokens": 4096})
print(ModelRegistry.list_models())
```
</details>

<details>
<summary>🧪 Test Dataset Generation</summary>

```python
from llm_evaluation_framework import TestDatasetGenerator

generator = TestDatasetGenerator()
dataset = generator.generate(num_samples=5, topic="math problems")
print(dataset)
```
</details>

---

## 📊 Workflow Overview


<p align="center">
  <img src="https://img.shields.io/badge/Workflow-Visualized-blue?style=flat-square">
</p>

The following diagram illustrates the high-level workflow of the framework:

```mermaid
flowchart TD
    A[Input Prompt] --> B[Model Inference Engine]
    B --> C[Scoring Strategies]
    C --> D[Evaluation Results]
    D --> E[Persistent Storage]
```

---


---

## ⚙️ Advanced Configuration

Advanced configuration allows you to tailor the framework to your specific needs:
- **Custom Models**: Integrate proprietary or experimental LLMs.
- **Custom Scoring**: Implement domain-specific evaluation metrics.
- **Persistence**: Choose between JSON, database, or cloud storage.
- **CLI Extensions**: Add new commands for automation.

<details>
<summary>⚙️ Click to View Advanced Configuration</summary>

> Extend the framework to suit your needs.

You can customize the framework by:
- Adding new model backends
- Defining custom scoring strategies
- Configuring persistent storage (JSON/DB)
- Extending CLI commands

Example:
```python
from llm_evaluation_framework.evaluation import CustomScoringStrategy

class MyScore(CustomScoringStrategy):
    def score(self, prediction, reference):
        return custom_logic(prediction, reference)
```
</details>

---

## 🧭 Navigation Tips

---

## 🆕 Recent Code Improvements

We have recently made the following improvements to the codebase:

- **Added Comprehensive Unit Tests for CLI**:  
  Introduced `llm_evaluation_framework/tests/test_cli.py` to cover all CLI commands (`score`, `save`, `load`, and help display).  
  This significantly improves test coverage for `llm_evaluation_framework/cli.py`.

- **Improved CLI Testability**:  
  Refactored test setup to dynamically load `cli.py` for isolated testing without requiring package installation.

- **Enhanced Code Coverage**:  
  The new tests ensure that CLI functionality is validated end-to-end, improving reliability and maintainability.

These changes make the CLI more robust and ensure that future modifications are less likely to introduce regressions.



---

