Metadata-Version: 2.1
Name: clustertk
Version: 0.16.2
Summary: A comprehensive toolkit for cluster analysis with full pipeline support
Home-page: https://github.com/alexeiveselov92/clustertk
Author: Aleksey Veselov
Author-email: Aleksey Veselov <alexei.veselov92@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/alexeiveselov92/clustertk
Project-URL: Documentation, https://clustertk.readthedocs.io
Project-URL: Repository, https://github.com/alexeiveselov92/clustertk
Project-URL: Bug Tracker, https://github.com/alexeiveselov92/clustertk/issues
Keywords: clustering,machine-learning,data-analysis,pipeline,kmeans,pca,data-science
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Provides-Extra: all
Provides-Extra: dev
Provides-Extra: extras
Provides-Extra: viz

# ClusterTK

[![PyPI version](https://badge.fury.io/py/clustertk.svg)](https://pypi.org/project/clustertk/)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![Tests](https://github.com/alexeiveselov92/clustertk/workflows/Tests/badge.svg)](https://github.com/alexeiveselov92/clustertk/actions/workflows/tests.yml)
[![codecov](https://codecov.io/gh/alexeiveselov92/clustertk/branch/main/graph/badge.svg)](https://codecov.io/gh/alexeiveselov92/clustertk)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**A comprehensive Python toolkit for cluster analysis with full pipeline support.**

ClusterTK provides a complete, sklearn-style pipeline for clustering: from raw data preprocessing to cluster interpretation and export. Perfect for data analysts who want powerful clustering without writing hundreds of lines of code.

## Features

- 🔄 **Complete Pipeline** - One-line solution from raw data to insights
- 📊 **Multiple Algorithms** - K-Means, GMM, Hierarchical, DBSCAN, HDBSCAN
- 🎯 **Auto-Optimization** - Automatic optimal cluster number selection
- 🧮 **Smart Dimensionality Reduction** - PCA/UMAP/None with algorithm-specific auto-mode
- 🎯 **Feature Selection** - Find optimal feature subsets for better clustering (**NEW in v0.16.0!**)
- 🎨 **Rich Visualization** - Beautiful plots (optional dependency)
- 📁 **Export & Reports** - CSV, JSON, HTML reports with embedded plots
- 💾 **Save/Load** - Persist and reload fitted pipelines
- 🔍 **Interpretation** - Profiling, naming, and feature importance analysis

## Quick Start

### Installation

```bash
# Core functionality
pip install clustertk

# With visualization
pip install clustertk[viz]
```

### Basic Usage

```python
import pandas as pd
from clustertk import ClusterAnalysisPipeline

# Load data
df = pd.read_csv('your_data.csv')

# Create and fit pipeline
pipeline = ClusterAnalysisPipeline(
    dim_reduction='auto',      # Smart selection (PCA/UMAP/None based on algorithm)
    handle_missing='median',
    correlation_threshold=0.85,
    n_clusters=None,           # Auto-detect optimal number
    verbose=True
)

pipeline.fit(df, feature_columns=['feature1', 'feature2', 'feature3'])

# Get results
labels = pipeline.labels_
profiles = pipeline.cluster_profiles_
metrics = pipeline.metrics_

print(f"Found {pipeline.n_clusters_} clusters")
print(f"Silhouette score: {metrics['silhouette']:.3f}")

# Export
pipeline.export_results('results.csv')
pipeline.export_report('report.html')

# Visualize (requires clustertk[viz])
pipeline.plot_clusters_2d()
pipeline.plot_cluster_heatmap()
```

## Documentation

- **[Installation Guide](docs/installation.md)** - Detailed installation instructions
- **[Quick Start](docs/quickstart.md)** - Get started in 5 minutes
- **[User Guide](docs/user_guide/README.md)** - Complete component documentation
  - [Preprocessing](docs/user_guide/preprocessing.md)
  - [Feature Selection](docs/user_guide/feature_selection.md)
  - [Clustering](docs/user_guide/clustering.md)
  - [Evaluation](docs/user_guide/evaluation.md)
  - [Interpretation](docs/user_guide/interpretation.md) - Profiles, naming, feature importance
  - [Visualization](docs/user_guide/visualization.md)
  - [Export](docs/user_guide/export.md)
- **[Examples](docs/examples.md)** - Real-world use cases
- **[FAQ](docs/faq.md)** - Common questions

## Pipeline Workflow

```
Raw Data → Preprocessing → Feature Selection → Dimensionality Reduction
→ Clustering → Evaluation → Interpretation → Export
```

Each step is configurable through pipeline parameters or can be run independently.

## Key Capabilities

### Preprocessing
- Missing value handling (median/mean/drop)
- Univariate outlier handling (winsorize/robust/clip/remove)
- Multivariate outlier detection (IsolationForest/LOF/EllipticEnvelope)
- Automatic scaling (robust/standard/minmax)
- Skewness transformation

### Dimensionality Reduction (v0.15.0+)
- **Auto-mode** - Smart selection based on algorithm + data
- **PCA** - Linear, preserves global structure (best for K-Means/GMM)
- **UMAP** - Non-linear, preserves local density (best for HDBSCAN/DBSCAN)
- **None** - Work in original feature space (low-dimensional data)

| Algorithm | Features | Auto Selection |
|-----------|----------|----------------|
| K-Means/GMM | <50 | None |
| K-Means/GMM | ≥50 | PCA |
| HDBSCAN/DBSCAN | <30 | None |
| HDBSCAN/DBSCAN | ≥30 | UMAP |

### Clustering Algorithms
- **K-Means** - Fast, spherical clusters
- **GMM** - Probabilistic, elliptical clusters
- **Hierarchical** - Dendrograms, hierarchical structure
- **DBSCAN** - Density-based, arbitrary shapes
- **HDBSCAN** - Advanced density-based, varying densities (v0.8.0+)

### Evaluation & Interpretation
- Silhouette score, Calinski-Harabasz, Davies-Bouldin metrics
- Automatic optimal k selection
- Cluster profiling and automatic naming
- **Feature importance analysis** (v0.9.0+)
  - Permutation importance
  - Feature contribution (variance ratio)
  - SHAP values (optional)

### Export & Reports
- CSV export (data + labels)
- JSON export (metadata + profiles)
- HTML reports with embedded visualizations
- Pipeline serialization (save/load)

## Examples

### HDBSCAN with UMAP (v0.15.0+)

```python
# Perfect for high-dimensional density-based clustering
pipeline = ClusterAnalysisPipeline(
    dim_reduction='umap',         # Preserves local density
    umap_n_components=10,          # NOT 2! For clustering, not viz
    clustering_algorithm='hdbscan'
)

pipeline.fit(high_dim_data)

# UMAP preserves density → HDBSCAN finds real clusters!
print(f"Found {pipeline.n_clusters_} clusters")
print(f"Noise ratio: {pipeline.cluster_profiles_.noise_ratio_:.1%}")
```

**Why UMAP for HDBSCAN?**
- PCA destroys local density → HDBSCAN finds only noise
- UMAP preserves local structure → HDBSCAN works correctly
- Auto-mode selects UMAP for HDBSCAN/DBSCAN when features >30

**Important:** Use `n_components=10-20` for clustering, NOT 2-3 (visualization only)!

### Feature Selection for Better Clustering (v0.16.0+)

```python
# Problem: You have 30 features, but not all are useful for clustering
# More features ≠ better clustering (curse of dimensionality)

# Step 1: Fit on all features
pipeline = ClusterAnalysisPipeline(dim_reduction='pca')
pipeline.fit(df)  # 30 features → Silhouette: 0.42

# Step 2: Find which features matter most
importance = pipeline.get_pca_feature_importance()
print(importance.head(10))  # Top 10 features by PCA loadings

# Step 3: Try refitting with top 10 features
comparison = pipeline.refit_with_top_features(
    n_features=10,
    importance_method='permutation',  # Best for clustering quality
    compare_metrics=True,
    update_pipeline=False  # Just compare, don't update yet
)

# Step 4: If metrics improved, update pipeline
if comparison['metrics_improved']:
    print(f"Improvement: {comparison['weighted_improvement']:+.1%}")
    pipeline.refit_with_top_features(n_features=10, update_pipeline=True)
    # New silhouette: 0.58 (+38% improvement!)
```

**Why Feature Selection?**
- Irrelevant features dilute clustering signal (noise)
- PCA can't fix bad features, only compress them
- 10 good features > 30 mixed features

**Three Importance Methods:**
- `'permutation'` - Best for clustering quality (default)
- `'contribution'` - Variance ratio analysis
- `'pca'` - PCA loadings (only if dim_reduction='pca')

### Feature Importance Analysis

```python
# Understand which features drive your clustering
results = pipeline.analyze_feature_importance(method='all')

# View permutation importance
print(results['permutation'].head())

# View feature contribution (variance ratio)
print(results['contribution'].head())

# Use top features for focused analysis
top_features = results['permutation'].head(5)['feature'].tolist()
```

### Algorithm Comparison

```python
# Compare multiple algorithms automatically
results = pipeline.compare_algorithms(
    X=df,
    feature_columns=['feature1', 'feature2', 'feature3'],
    algorithms=['kmeans', 'gmm', 'hierarchical', 'dbscan'],
    n_clusters_range=(2, 8)
)

print(results['comparison'])  # DataFrame with metrics
print(f"Best algorithm: {results['best_algorithm']}")

# Visualize comparison
pipeline.plot_algorithm_comparison(results)
```

### Customer Segmentation

```python
pipeline = ClusterAnalysisPipeline(
    n_clusters=None,  # Auto-detect
    auto_name_clusters=True
)

pipeline.fit(customers_df,
            feature_columns=['age', 'income', 'purchases'],
            category_mapping={
                'demographics': ['age', 'income'],
                'behavior': ['purchases']
            })

pipeline.export_report('customer_segments.html')
```

### Anomaly Detection

```python
pipeline = ClusterAnalysisPipeline(
    clustering_algorithm='dbscan'
)

pipeline.fit(transactions_df)
anomalies = transactions_df[pipeline.labels_ == -1]
```

More examples: [docs/examples.md](docs/examples.md)

## Requirements

- Python 3.8+
- numpy >= 1.20.0
- pandas >= 1.3.0
- scikit-learn >= 1.0.0
- scipy >= 1.7.0
- joblib >= 1.0.0

Optional (for visualization):
- matplotlib >= 3.4.0
- seaborn >= 0.11.0

## Contributing

Contributions are welcome! Please check:
- [GitHub Issues](https://github.com/alexeiveselov92/clustertk/issues) - Report bugs
- [GitHub Discussions](https://github.com/alexeiveselov92/clustertk/discussions) - Questions

## License

MIT License - see [LICENSE](LICENSE) file for details.

## Citation

If you use ClusterTK in your research, please cite:

```bibtex
@software{clustertk2024,
  author = {Veselov, Aleksey},
  title = {ClusterTK: A Comprehensive Python Toolkit for Cluster Analysis},
  year = {2024},
  url = {https://github.com/alexeiveselov92/clustertk}
}
```

## Links

- **PyPI**: https://pypi.org/project/clustertk/
- **GitHub**: https://github.com/alexeiveselov92/clustertk
- **Documentation**: [docs/](docs/)
- **Author**: Aleksey Veselov (alexei.veselov92@gmail.com)

---

Made with ❤️ for the data science community
