A Python package implementing the clustering algorithm proposed in the paper  
**"An Agglomerative Clustering Algorithm for Simulation Output Distributions Using Regularized Wasserstein Distance"**,  
accepted in the *INFORMS Journal on Data Science*.
The preprint is available on [arXiv:2407.12100](https://arxiv.org/abs/2407.12100).  
This link will be updated once the final published version becomes available.



The package can be installed using 

```python
pip install --index-url https://pypi.org/simple/ --no-deps distclust==0.0.4
```

## Main function
```python
cluster_distributions(
    dist_file,
    reg=0.5,
    n_clusters=None,
    calculate_barycenter=False,
    stop_threshold=10 ** -9,
    num_of_iterations=1000,
    plt_dendrogram=True,
    path_dendrogram=None,
    sup_barycenter=100,
    t0=0.005,
    theta=0.005,
):
```

## Description
This function performs hierarchical (agglomerative) clustering of empirical probability distributions using the *regularized (entropic) Wasserstein* distance.  
It takes a JSON-formatted string that encodes a list of distributions, computes all pairwise regularized Wasserstein distances, and then performs agglomerative clustering.

- Returns one JSON string with each distribution and its assigned cluster.
- If `calculate_barycenter=True`, it also computes barycenters of each cluster and returns a second JSON string with the barycenters.

---

### Function Parameters

- **`dist_file`** *(str)*:  
  A JSON-formatted string containing a dictionary of distributions.  
  Each key in the dictionary is a **distribution number**, mapped to another dictionary with:
  - `"id"`: The identifier of the distribution.  
  - `"data_points"`: A list of tuples representing the data points.  

  Example format: [`(https://github.com/mohammadmgh78/Agglomerative_Clustering_Distribution/blob/main/distclust/JSON_test.txt)`]((https://github.com/mohammadmgh78/Agglomerative_Clustering_Distribution/blob/main/distclust/JSON_test.txt))
- **`reg`** *(float)*:  
  Entropic regularization parameter for the Wasserstein distance. Must be positive.

- **`n_clusters`** *(int or None)*:  
  Number of clusters to form. If `None`, the optimal number is chosen using the silhouette index.

- **`calculate_barycenter`** *(bool)*:  
  If `True`, compute a regularized Wasserstein barycenter for each cluster.  
  If `False`, only clustering results are returned.

- **`stop_threshold`** *(float)*:  
  Convergence threshold for the Sinkhorn iterations.

- **`num_of_iterations`** *(int)*:  
  Maximum number of Sinkhorn iterations for each OT distance computation.

- **`plt_dendrogram`** *(bool)*:  
  If `True`, generate and display the dendrogram plot.  
  If a file path is provided (`path_dendrogram`), also save it.

- **`sup_barycenter`** *(int)*:  
  Number of support points to initialize for barycenter computation.

- **`t0`** *(float)*:  
  Base step size for the barycenter probability vector (`a`) update.

- **`theta`** *(float)*:  
  Relaxation parameter for the barycenter support (`X`) update.

---

### Returns

If `calculate_barycenter=False`:
- **`json_clusters`** *(str)*: JSON with each distribution's ID, real data points, and assigned cluster label.

If `calculate_barycenter=True`:
- **`json_clusters`** *(str)*: JSON with each distribution's ID, real data points, and assigned cluster label.
- **`json_barycenters`** *(str)*: JSON with each cluster's barycenter, including unnormalized supports and probability masses.

If `plt_dendrogram=True`:
- Displays the dendrogram plot.  
- If `path_dendrogram` is provided, saves the dendrogram as a PNG file to that path.

## Other Functions in `distclust`

We also provide the following functions that might be useful to some users:

1. **`density_calc`** – Compute empirical probability masses.  
2. **`density_calc_list`** – Batch probability mass computation.  
3. **`fill_ot_distance`** – Compute and store regularized Wasserstein distances between all systems.  
4. **`plot_dendrogram`** – Dendrogram visualization.  
5. **`silhouette_score_agglomerative`** – Choose number of clusters.  
6. **`find_barycenter`** – Compute Wasserstein barycenter.  
7. **`calculate_OT_cost_bary`** – OT computation for barycenter step.  
