# MGP-Imputer: Missing Value Imputation with Deep Gaussian Processes

[![PyPI version](https://badge.fury.io/py/mgp-imputer.svg)](https://badge.fury.io/py/mgp-imputer)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A PyTorch-based implementation of Missing Gaussian Processes (MGP) for missing value imputation, wrapped in a user-friendly `scikit-learn` compatible API.

This package allows you to seamlessly integrate Deep Gaussian Process models into your data preprocessing pipelines for robust and uncertainty-aware imputation. It is based on the paper ["Gaussian processes for missing value imputation"](https://www.sciencedirect.com/science/article/pii/S0950705123003532).

## Features

- **Scikit-learn Compatible:** Use `fit`, `predict`, and `fit_transform` methods just like any other scikit-learn transformer.
- **Two Imputation Strategies:**
    - `chained` (Default): Builds a separate GP layer for each feature with missing values, modeling dependencies in a chained fashion (MGP).
    - `holistic`: Builds a single, multi-output Deep GP to model all features simultaneously.
- **Probabilistic Imputation:** Returns both the imputed values and the standard deviation, giving you a measure of uncertainty for each imputed value.
- **GPU Accelerated:** Leverages PyTorch to run on CUDA devices for significant speedups.

## Installation

You can install `mgp-imputer` directly from PyPI:

```bash
pip install mgp-imputer
```



## **Quick Start**
Here's how to use `MGPImputer` to fill in missing values (`np.nan`) in your dataset.
```bash
import numpy as np
import pandas as pd
from mgp import MGPImputer

# 1. Create a synthetic dataset with 20% missing values
np.random.seed(42)
n_samples, n_features = 200, 5
X_true = np.random.rand(n_samples, n_features) * 10
X_missing = X_true.copy()
missing_mask = np.random.rand(n_samples, n_features) < 0.2
X_missing[missing_mask] = np.nan

print(f"Created a dataset with {np.sum(missing_mask)} missing values.")

# 2. Initialize the MGPImputer
# Strategies can be 'chained' (default) or 'holistic'
imputer = MGPImputer(
    imputation_strategy='chained',
    n_inducing_points=100,
    n_iterations=1000, # Use more iterations for real data
    learning_rate=0.01,
    batch_size=64,
    verbose=True,
    seed=42
)

# 3. Fit on the data and transform it to get imputed values
# The imputer returns the imputed data and the standard deviation of the predictions
X_imputed, X_std = imputer.fit_transform(X_missing)

# 4. Evaluate the imputation quality
rmse = np.sqrt(np.mean((X_imputed[missing_mask] - X_true[missing_mask])**2))
print(f"\nImputation complete.")
print(f"RMSE on missing values: {rmse:.4f}")

# The result is a complete numpy array
print("\nImputed data shape:", X_imputed.shape)
print("Number of NaNs in imputed data:", np.isnan(X_imputed).sum())
```


## **Citation**
If you use this work in your research, please cite the original paper:

Jafrasteh, B., Hernández-Lobato, D., Lubián-López, S. P., & Benavente-Fernández, I. (2023). Gaussian processes for missing value imputation. Knowledge-Based Systems, 273, 110603.
[Missing GPs](https://www.sciencedirect.com/science/article/pii/S0950705123003532)



## License

This project is licensed under the MIT License.

