Metadata-Version: 2.3
Name: multimedeval
Version: 1.0.0
Summary: A Python tool to evaluate the performance of VLM on the medical domain.
License: MIT
Keywords: evaluation,medical,vlm
Author: Corentin Royer
Requires-Python: >=3.9,<3.13
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: Pillow (>=10.2.0)
Requires-Dist: bert_score
Requires-Dist: datasets (>=2.16)
Requires-Dist: dotmap (>=1.3.30,<2.0.0)
Requires-Dist: gdown
Requires-Dist: h5py
Requires-Dist: jsonpickle
Requires-Dist: kaggle
Requires-Dist: medmnist
Requires-Dist: nibabel
Requires-Dist: nltk
Requires-Dist: protobuf
Requires-Dist: pydicom
Requires-Dist: scikit-learn (==1.3.2)
Requires-Dist: spacy (>=3.6.0,<3.8.4)
Requires-Dist: statsmodels
Requires-Dist: torch
Requires-Dist: torchmetrics
Requires-Dist: transformers
Requires-Dist: types-requests
Project-URL: Documentation, https://github.com/corentin-ryr/MultiMedEval
Project-URL: Homepage, https://github.com/corentin-ryr/MultiMedEval
Project-URL: Repository, https://github.com/corentin-ryr/MultiMedEval
Description-Content-Type: text/markdown

# MultiMedEval

MultiMedEval is a library to evaluate the performance of Vision-Language Models (VLM) on medical domain tasks. The goal is to have a set of benchmark with a unified evaluation scheme to facilitate the development and comparison of medical VLM.
We include 24 tasks representing 10 different imaging modalities and some text-only tasks.

![tests workflow](https://github.com/corentin-ryr/MultiMedEval/actions/workflows/python-tests.yml/badge.svg) ![PyPI - Version](https://img.shields.io/pypi/v/multimedeval) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/multimedeval) ![GitHub License](https://img.shields.io/github/license/corentin-ryr/MultiMedEval)

## Tasks

<details>
  <summary>Question Answering</summary>

| Task     | Description                                            | Modality         | Size |
| -------- | ------------------------------------------------------ | ---------------- | ---- |
| MedQA    | Multiple choice questions on general medical knowledge | General medicine | 1273 |
| PubMedQA | Yes/no/maybe questions based on PubMed paper abstracts | General medicine | 500  |
| MedMCQA  | Multiple choice questions on general medical knowledge | General medicine | 4183 |

</details>

</br>

<details>
  <summary>Visual Question Answering</summary>

| Task     | Description                              | Modality  | Size |
| -------- | ---------------------------------------- | --------- | ---- |
| VQA-RAD  | Open ended questions on radiology images | X-ray     | 451  |
| Path-VQA | Open ended questions on pathology images | Pathology | 6719 |
| SLAKE    | Open ended questions on radiology images | X-ray     | 1061 |

</details>

</br>

<details>
  <summary>Report Comparison</summary>

| Task                       | Description                                                                       | Modality    | Size  |
| -------------------------- | --------------------------------------------------------------------------------- | ----------- | ----- |
| MIMIC-CXR-ReportGeneration | Generation of finding sections of radiology reports based on the radiology images | Chest X-ray | 2347  |
| MIMIC-III                  | Summarization of radiology reports                                                | Text        | 13054 |

</details>

</br>

<details>
  <summary>Natural Language Inference</summary>

| Task   | Description                                      | Modality         | Size |
| ------ | ------------------------------------------------ | ---------------- | ---- |
| MedNLI | Natural Language Inference on medical sentences. | General medicine | 1422 |

</details>

</br>

<details>
  <summary>Image Classification</summary>

| Task                          | Description                                                                                                   | Modality      | Size  |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------- | ------------- | ----- |
| MIMIC-CXR-ImageClassification | Classification of radiology images into 5 diseases                                                            | Chest X-ray   | 5159  |
| VinDr-Mammo                   | Classification of mammography images into 5 BIRADS levels                                                     | Mammography   | 429   |
| Pad-UFES-20                   | Classification of skin lesion images into 7 diseases                                                          | Dermatology   | 2298  |
| CBIS-DDSM-Mass                | Classification of masses in mammography images into "benign", "malignant" or "benign without callback"        | Mammography   | 378   |
| CBIS-DDSM-Calcification       | Classification of calcification in mammography images into "benign", "malignant" or "benign without callback" | Mammography   | 326   |
| MNIST-Oct                     | Image classification of Optical coherence tomography of the retine                                            | OCT           | 1000  |
| MNIST-Path                    | Image classification of pathology image                                                                       | Pathology     | 7180  |
| MNIST-Blood                   | Image classification of blood cell seen through a microscope                                                  | Microscopy    | 3421  |
| MNIST-Breast                  | Image classification of mammography                                                                           | Mammography   | 156   |
| MNIST-Derma                   | Image classification of skin deffect images                                                                   | Dermatology   | 2005  |
| MNIST-OrganC                  | Image classification of abdominal CT scan                                                                     | CT            | 8216  |
| MNIST-OrganS                  | Image classification of abdominal CT scan                                                                     | CT            | 8827  |
| MNIST-Pneumonia               | Image classification of chest X-Rays                                                                          | X-Ray         | 624   |
| MNIST-Retina                  | Image classification of the retina taken with a fondus camera                                                 | Fondus Camera | 400   |
| MNIST-Tissue                  | Image classification of kidney cortex seen through a microscope                                               | Microscopy    | 12820 |

</details>

</br>

<p align="center">
    <img src="figures/sankey.png" alt="Sankey graph">
    <br>
    <em>Representation of the modalities, tasks and datasets in MultiMedEval</em>
</p>

## Setup

To install the library, you can use `pip`

```console
pip install multimedeval
```

To run the benchmark on your model, you first need to create an instance of the `MultiMedEval` class.

```python
from multimedeval import MultiMedEval, SetupParams, EvalParams
from multimedeval.utils import BatcherInput, BatcherOutput

engine = MultiMedEval()
```

You then need to call the `setup` function of the `engine`. This will download the datasets if needed and prepare them for evaluation. You can specify where to store the data and which datasets you want to download.

```python
setupParams = SetupParams(medqa_dir="data/")
tasksReady = engine.setup(setup_params=setupParams)
```

Here we initialize the `SetupParams` dataclass with only the path for the MedQA dataset. If you omit to pass a directory for some of the datasets, they will be skipped during the evaluation. During the setup process, the script will need a Physionet username and password to download "VinDr-Mammo", "MIMIC-CXR" and "MIMIC-III". You also need to setup Kaggle on your machine before running the setup as the "CBIS-DDSM" is hosted on Kaggle. At the end of the setup process, you will see a summary of which tasks are ready and which didn't run properly and the function will return a summary in the form of a dictionary.

## Usage

### Implement the Batcher

The user must implement one Callable: `batcher`. It takes a batch of input and must return the answer.
The batch is a list of inputs.
Each input is an instance of @dataclass `BatcherInput`, containing the following fields:

- `conversation`: a prompt in the form of a Hugginface style conversation between a user and an assistant.
- `images`: a list of Pillow images. The number of images matches the number of <img> tokens in the prompt and are ordered.
- `segmentation_masks`: (optional) a list of segmentation masks, the number of which matches that of <seg> tokens in the prompt and are ordered.

```python
[
    BatcherInput(
        conversation = 
          [
              {"role": "user", "content": "This is a question with an image <img>."},
              {"role": "assistant", "content": "This is the answer."},
              {"role": "user", "content": "This is a question with an image <img>."},
          ],
        images = [PIL.Image(), PIL.Image()],
        segmentation_masks = [PIL.Image(), PIL.Image()]
    ),
    BatcherInput(
        conversation =
          [
              {"role": "user", "content": "This is a question without images."},
              {"role": "assistant", "content": "This is the answer."},
              {"role": "user", "content": "This is a question without images."},
          ],
        images = [],
        segmentation_masks = []
    ),

]
```

Here is an example of a `batcher` without any logic:

```python
def batcher(prompts: List[BatcherInput]) -> List[BatcherOutput]:
    return ["Answer" for _ in prompts]
```

A function is the simplest example of a Callable but the batcher can also be implemented as a Callable class (i.e. a class implementing the `__call__` method). Doing it this way allows to initialize the model in the `__init__` function of the class. We give an example for the Mistral model (a language-only model).

```python
class batcherMistral:
    def __init__(self) -> None:
        self.model: MistralModel = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
        self.tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
        self.tokenizer.pad_token = self.tokenizer.eos_token

    def __call__(self, prompts: List[BatcherInput]) -> List[BatcherOutput]:
        model_inputs = [self.tokenizer.apply_chat_template(messages.conversation, return_tensors="pt", tokenize=False) for messages in prompts]
        model_inputs = self.tokenizer(model_inputs, padding="max_length", truncation=True, max_length=1024, return_tensors="pt")

        generated_ids = self.model.generate(**model_inputs, max_new_tokens=200, do_sample=True, pad_token_id=self.tokenizer.pad_token_id)

        # Remove the first 1024 tokens (prompt)
        generated_ids = generated_ids[:, model_inputs["input_ids"].shape[1] :]

        answers = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
        return [BatcherOutput(text=answer) for answer in answers]
```

### Run the benchmark

To run the benchmark, call the `eval` method of the `MultiMedEval` class with the list of tasks to benchmark on, the batcher to ealuate and the evaluation parameters. If the list is empty, all the tasks will be benchmarked.

```python
evalParams = EvalParams(batch_size=128)
results = engine.eval(["MedQA", "VQA-RAD"], batcher, eval_params=evalParams)
```

## MultiMedEval parameters

The `SetupParams` class takes a path for each dataset:

- medqa_dir: will be use in Huggingface's `load_dataset` as cache_dir
- pubmedqa_dir: will be use in Huggingface's `load_dataset` as cache_dir
- medmcqa_dir: will be use in Huggingface's `load_dataset` as cache_dir
- vqa_rad_dir: will be use in Huggingface's `load_dataset` as cache_dir
- path_vqa_dir: will be use in Huggingface's `load_dataset` as cache_dir
- slake_dir: the dataset is currently hosted on Google Drive which can be an issue on some systems.
- mimic_iii_dir: path for the (physionet) MIMIC-III dataset.
- mednli_dir: will be use in Huggingface's `load_dataset` as cache_dir
- mimic_cxr_dir: path for the (physionet) MIMIC-CXR dataset.
- vindr_mammo_dir: path for the (physionet) VinDr-Mammo dataset.
- pad_ufes_20_dir
- cbis_ddsm_dir: dataset hosted on Kaggle. Kaggle must be set up on the system (see [this](https://www.kaggle.com/docs/api#getting-started-installation-&-authentication))
- mnist_oct_dir
- mnist_path_dir
- mnist_blood_dir
- mnist_breast_dir
- mnist_derma_dir
- mnist_organc_dir
- mnist_organs_dir
- mnist_pneumonia_dir
- mnist_retina_dir
- mnist_tissue_dir
- chexbert_dir: path for the CheXBert model checkpoint
- physionet_username: physionet username to download MIMIC and VinDr-Mammo
- physionet_password: password for the physionet account

The `EvalParams` class takes the following arguments:

- batch_size: The size of the batches sent to the user's batcher Callable.
- run_name: The name to use for the folder where the output will be stored.
- fewshot: A boolean indicating whether the evaluation is few-shot.
- num_workers: The number of workers for the dataloader.
- device: The device to run the evaluation on.
- tensorBoardWriter: The tensorboard writer to use for logging.
- tensorboardStep: The global step for logging to tensorboard.

## Additional tasks

To add a new task to the list of already implemented ones, create a folder named `MultiMedEvalAdditionalDatasets` and a subfolder with the name of your dataset.

Inside your dataset folder, create a `json` file that follows the following template for a VQA dataset:

```json
{
  "taskType": "VQA",
  "modality": "Radiology",
  "samples": [
    {
      "question": "Question 1",
      "answer": "Answer 1",
      "images": ["image1.png", "image2.png"]
    },
    { "question": "Question 2", "answer": "Answer 2", "images": ["image1.png"] }
  ]
}
```

And for a QA dataset:

```json
{
  "taskType": "QA",
  "modality": "Pathology",
  "samples": [
    {
      "question": "Question 1",
      "answer": "Answer 1",
      "options": ["Option 1", "Option 2"],
      "images": ["image1.png", "image2.png"]
    },
    {
      "question": "Question 2",
      "answer": "Answer 2",
      "options": ["Option 1", "Option 2"],
      "images": ["image1.png"]
    }
  ]
}
```

Note that in both cases the `images` key is optional. If the `taskType` is VQA, the metrics computed will be BLEU-1, accuracy for closed and open questions, recall and recall for open questions as well as F1. For the QA `taskType`, the tool will report the accuracy (by comparing the answer to every option using BLEU).

## Reference

```
@misc{royer2024multimedeval,
      title={MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language Models},
      author={Corentin Royer and Bjoern Menze and Anjany Sekuboyina},
      year={2024},
      eprint={2402.09262},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```

