# CleanFrames

CleanFrames is a Python library designed to clean and summarize image frames stored in folders efficiently using embedding models and clustering techniques. It processes folders of frames, removes duplicates or near-duplicates, caches embeddings and reports for faster subsequent runs, and saves cleaned/removed images alongside the original dataset.

## Key Features

- Processes folders of image frames instead of videos.
- Supports multiple embedding models to represent frames.
- Various clustering methods to group similar frames.
- Caches embeddings and reports to optimize performance.
- Saves cleaned and removed images beside the original dataset.
- Visualization tools to inspect clusters and frame pairs.
- Generates text-only console reports summarizing cleaning results.

## Installation

To install CleanFrames, clone the repository and install the required dependencies:

```bash
git clone <repository-url>
cd cleanframes
pip install -r requirements.txt
```

## Usage

### Basic Example

```python
from cleanframes import CleanFrame

# Initialize with folder path, embedding model, clustering method, and caching enabled
cf = CleanFrame(
    path='path/to/frames_folder',
    model='clip-ViT-B-32',
    cluster='kmeans',
    cache=True,
    verbose=True
)

# Run the full cleaning pipeline: embedding, clustering, cleaning
cf.run()

# Generate a text-only console report of the cleaning results
cf.report()

# Visualize clusters of frames
cf.visualize_clusters()
```

### Optimized Workflow Example

```python
from cleanframes import CleanFrame

# Initialize with different model and clustering method
cf = CleanFrame(
    path='path/to/frames_folder',
    model='clip-ViT-L-14',
    cluster='dbscan',
    cache=True,
    verbose=True
)

# Run the cleaning process
cf.run()

# Print cleaning report
cf.report()

# Visualize clusters and frame pairs
cf.visualize_clusters()
```

## Caching and Outputs

- Embeddings and cleaning reports are cached within the specified cache folder for faster reruns.
- Cleaned and removed images are saved beside the original frames in the dataset folder, allowing easy inspection and further use.
- The caching mechanism avoids redundant computations, improving efficiency when processing large datasets.

## Supported Embedding Models

CleanFrames supports multiple embedding models for frame representation, including but not limited to:

- CLIP models such as `clip-ViT-B-32` and `clip-ViT-L-14`
- Additional models can be integrated as needed.

## Clustering Methods

Available clustering algorithms include:

- KMeans clustering
- DBSCAN clustering
- Other clustering methods can be added or customized.

## Visualization

CleanFrames provides visualization tools to help users inspect the clustering results and pairs of similar frames. This helps verify the cleaning quality and understand the grouping of frames.

## Reporting

After cleaning, CleanFrames generates a concise text-only console report summarizing:

- Number of frames processed
- Number of frames removed
- Number of frames retained

This report provides insights into the effectiveness of the cleaning process.

---

For more detailed information and advanced usage, please refer to the source code and examples provided in the repository.
