Metadata-Version: 2.4
Name: flexeval
Version: 0.15.0
Summary: 
License-File: LICENSE
Author: ryokan-ri
Author-email: ryokan.ri@sbintuitions.co.jp
Requires-Python: >=3.10,<3.13
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Provides-Extra: vllm
Provides-Extra: wandb
Requires-Dist: datasets (>=2.14.6,<3.0.0)
Requires-Dist: evaluate (>=0.4.1,<0.5.0)
Requires-Dist: fuzzywuzzy (>=0.18.0,<0.19.0)
Requires-Dist: google-api-python-client (>=2.131.0,<3.0.0)
Requires-Dist: jinja2 (>=3.1.2,<4.0.0)
Requires-Dist: jiwer (>=3.0.4,<4.0.0)
Requires-Dist: jsonargparse[jsonnet] (>=4.26.1,<5.0.0)
Requires-Dist: litellm (>=1.52.9,<2.0.0)
Requires-Dist: loguru (>=0.7.2,<0.8.0)
Requires-Dist: math-verify[antlr4-13-2] (>=0.7.0,<0.8.0)
Requires-Dist: openai (>=1.52.2,<2.0.0)
Requires-Dist: peft (>=0.10.0,<0.11.0)
Requires-Dist: pyarrow (==16.1.0)
Requires-Dist: python-levenshtein (>=0.23.0,<0.24.0)
Requires-Dist: rouge (>=1.0.1,<2.0.0)
Requires-Dist: sacrebleu[ja] (>=2.4.1,<3.0.0)
Requires-Dist: scikit-learn (==1.6.1)
Requires-Dist: scipy (>=1.13.0,<2.0.0)
Requires-Dist: smart-open (>=7.1.0,<8.0.0)
Requires-Dist: sudachipy (>=0.6.10)
Requires-Dist: tiktoken (>=0.9.0,<0.10.0)
Requires-Dist: transformers[ja,sentencepiece,torch] (>=4.34.1,<5.0.0)
Requires-Dist: vllm (==0.10.2) ; extra == "vllm"
Requires-Dist: wandb (>=0.17.2,<0.18.0) ; extra == "wandb"
Description-Content-Type: text/markdown

# FlexEval

![logo](docs/assets/logo.png)

**Flexible evaluation tool for language models. Easy to extend, highly customizable!**

<h4 align="center">
    <p>
        <b>English</b> |
        <a href="https://github.com/sbintuitions/flexeval/blob/main/README_ja.md">日本語</a> |
    </p>
</h4>

With FlexEval, you can evaluate language models with:

* Zero/few-shot in-context learning tasks
* Open-ended text-generation benchmarks such as [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) with automatic evaluation using GPT-4
* Log-probability-based multiple-choice tasks 
* Computing perplexity of text data

For more use cases, see the [documentation](https://sbintuitions.github.io/flexeval/).


## Key Features

* **Flexibility**: `flexeval` is flexible in terms of the evaluation setup and the language model to be evaluated.
* **Modularity**: The core components of `flexeval` are easily extensible and replaceable.
* **Clarity**: The results of evaluation are clear and all the details are saved.
* **Reproducibility**: `flexeval` should be reproducible, with the ability to save and load configurations and results.

## Installation

```bash
pip install flexeval
```

## Quick Start

The following minimal example evaluates the hugging face model `sbintuitions/sarashina2.2-0.5b` with the `commonsense_qa` task.

### Run Command

```bash
flexeval_lm \
  --language_model HuggingFaceLM \
  --language_model.model "sbintuitions/sarashina2.2-0.5b" \
  --eval_setup "commonsense_qa" \
  --save_dir "results/commonsense_qa"
```

### Output

```
...
2025-09-03 16:22:58.434 | INFO     | flexeval.core.evaluate_generation:evaluate_generation:92 - {'exact_match': 0.3185913185913186, 'finish_reason_ratio-stop': 1.0, 'avg_output_length': 9.095004095004095, 'max_output_length': 69, 'min_output_length': 2}
...
```

The results saved in `--saved_dir` contain:

* `config.json`: The configuration of the evaluation, which can be used to replicate the evaluation.
* `metrics.json`: The evaluation metrics.
* `outputs.jsonl`: The outputs of the language model that comes with instance-level metrics.

You can flexibly customize the evaluation by specifying command-line arguments or configuration files.
Besides the [Transformers](https://github.com/huggingface/transformers) model, you can also evaluate models via [OpenAI ChatGPT](https://openai.com/index/openai-api/) and [vLLM](https://github.com/vllm-project/vllm), and other models can be readily added!

## Next Steps
* Run `flexeval_presets` to check the list of off-the-shelf presets in addition to `commonsense_qa`. You can find the details in the [Preset Configs](https://sbintuitions.github.io/flexeval/preset_configs/) section.
* See [Getting Started](https://sbintuitions.github.io/flexeval/getting_started/) to check the tutorial examples for other kinds of tasks.
* See the [Configuration Guide](https://sbintuitions.github.io/flexeval/configuration_guide/) to set up your evaluation.

