Metadata-Version: 2.4
Name: smartdownsample
Version: 0.3.0
Summary: Smart image downsampling for image classification datasets
Author: peteraddax
License: MIT
Project-URL: Homepage, https://github.com/PetervanLunteren/smartdownsample
Project-URL: Repository, https://github.com/PetervanLunteren/smartdownsample
Project-URL: Issues, https://github.com/PetervanLunteren/smartdownsample/issues
Project-URL: Documentation, https://github.com/PetervanLunteren/smartdownsample#readme
Keywords: image,downsampling,camera-trap,machine-learning,computer-vision,diversity,deduplication
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Image Processing
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21.0
Requires-Dist: Pillow>=8.0.0
Requires-Dist: imagehash>=4.2.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: tqdm>=4.62.0
Requires-Dist: natsort>=8.0.0
Requires-Dist: matplotlib>=3.0.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: black>=22.0; extra == "dev"
Requires-Dist: flake8>=4.0; extra == "dev"
Requires-Dist: mypy>=0.900; extra == "dev"
Dynamic: license-file

# smartdownsample

**Fast, simple image downsampling that just works**

SmartDownsample selects images from large collections in seconds, not hours. One simple function that works equally fast whether you're selecting 100 or 23,000 images from 24,000.

## Installation

```bash
pip install smartdownsample
```

## Features

- ⚡ **Always fast** - Seconds for any selection ratio
- 🎯 **Smart bucketing** - Better than random, faster than complex algorithms
- 📊 **Scales linearly** - 24k images? No problem
- 🔧 **Dead simple** - One function, always works
- 🎲 **Reproducible** - Set seed for consistent results

## Usage

```python
from smartdownsample import select_distinct

# Select 100 images from 24,000 - takes seconds
selected = select_distinct(
    image_paths=my_24k_images,
    target_count=100
)

# Select 23,000 images from 24,000 - also takes seconds!
selected = select_distinct(
    image_paths=my_24k_images,
    target_count=23000
)

# It's that simple.
print(f"Selected {len(selected)} images")
```

## How It Works

1. **Hash images** - Quick perceptual hashing (4 parallel workers)
2. **Create buckets** - Group similar images together
3. **Sample evenly** - Take images from each bucket for diversity

Result: Better than random selection, without the complexity.

## Performance

| Task | Time |
|------|------|
| 100 from 1,000 | <5 sec |
| 900 from 1,000 | <5 sec |
| 1,000 from 24,000 | ~30 sec |
| 23,000 from 24,000 | ~30 sec |
| Any ratio | Fast ✓ |

## Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `image_paths` | Required | List of image file paths (str or Path objects) |
| `target_count` | Required | Exact number of images to select |
| `n_workers` | `4` | Number of parallel workers (4 is optimal) |
| `hash_size` | `8` | Hash size (8 is fast and good enough) |
| `random_seed` | `42` | Random seed for reproducible results |
| `show_progress` | `True` | Whether to display progress bars |

## Why It's Fast

- **Fixed algorithm** - No switching between methods
- **Simple hashing** - DHash is faster than PHash
- **Smart bucketing** - O(n) grouping instead of O(n²) comparisons
- **Parallel processing** - But capped at 4 workers (diminishing returns above that)

## License

MIT License – see LICENSE file.
