Metadata-Version: 2.4
Name: patx
Version: 0.2.6
Summary: Pattern eXtraction for Time Series and Spatial Data
Author-email: Jonas Wolber <jonascw@web.de>
Maintainer-email: Jonas Wolber <jonascw@web.de>
License-Expression: MIT
Project-URL: Repository, https://github.com/Prgrmmrjns/patX
Keywords: time-series,spatial-data,feature-engineering,pattern-extraction,machine-learning,optimization,polynomial-patterns
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21.0
Requires-Dist: pandas>=1.5.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: optuna>=3.0.0
Requires-Dist: lightgbm>=3.3.0
Requires-Dist: matplotlib>=3.5.0
Requires-Dist: seaborn>=0.11.0
Requires-Dist: pyarrow>=10.0.0
Dynamic: license-file

# PatX - Pattern eXtraction for Time Series Feature Engineering

[![PyPI version](https://badge.fury.io/py/patx.svg)](https://badge.fury.io/py/patx)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

PatX is a Python package for extracting polynomial patterns from time series data to create features for machine learning models. 
It uses Optuna optimization to automatically find patterns that work best for your target variable, with intelligent similarity metrics and flexible pattern matching.

## Installation

```bash
pip install patx
```

## Quick Start

Copy and paste this complete example to get started immediately:

```python
import numpy as np
import pandas as pd
from patx import feature_extraction, load_remc_data
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load the included REMC dataset with two input series (H3K4me3, H3K4me1)
data = load_remc_data(series=("H3K4me3", "H3K4me1"))
input_series = data['X_list']  # list of arrays, one per input series
y = data['y']
series_names = data['series_names']

print(f"Loaded {len(input_series)} input series: {series_names}")
print(f"Samples: {len(y)}, time points per input series: {input_series[0].shape[1]}")  # (1841, 40)

# Split data
indices = np.arange(len(y))
train_indices, test_indices = train_test_split(
    indices, test_size=0.2, random_state=42, stratify=y
)

# Use single input series as simple Pandas DataFrame (could also be 1D numpy array)
input_series_train = pd.DataFrame(input_series[0][train_indices])  # Simple DataFrame
input_series_test = pd.DataFrame(input_series[0][test_indices])    # Simple DataFrame
y_train, y_test = pd.Series(y[train_indices]), y[test_indices]

# Extract patterns and train model
result = feature_extraction(
    input_series_train=input_series_train, 
    y_train=y_train, 
    input_series_test=input_series_test, 
    n_trials=100, 
    show_progress=False
)

# Get results
trained_model = result['model']
patterns = result['patterns']
test_probabilities = trained_model.predict_proba(result['test_features'])

# Check performance
auc_score = roc_auc_score(y_test, test_probabilities)
print(f"\nResults:")
print(f"Found {len(patterns)} patterns from single input series")
print(f"Test AUC: {auc_score:.4f}")
print(f"Model features shape: {result['train_features'].shape}")  # (1177, 4)
```

### Multiple Input Series Example

For multiple input series data:

```python
# Use multiple input series as list of DataFrames
input_series_train_multiple = [pd.DataFrame(X[train_indices]) for X in input_series]
input_series_test_multiple = [pd.DataFrame(X[test_indices]) for X in input_series]

# Extract patterns from multiple input series
multiple_result = feature_extraction(
    input_series_train=input_series_train_multiple, 
    y_train=y_train, 
    input_series_test=input_series_test_multiple, 
    n_trials=50, 
    show_progress=False
)

# Check multiple input series results
multiple_probs = multiple_result['model'].predict_proba(multiple_result['test_features'])
multiple_auc = roc_auc_score(y_test, multiple_probs)
print(f"Multiple input series: {len(multiple_result['patterns'])} patterns, AUC={multiple_auc:.4f}")
print(f"Model features shape: {multiple_result['train_features'].shape}")  # (1177, 6)
```


### Input Data Types

PatX works with simple Pandas DataFrames or 1D numpy arrays:

```python
import pandas as pd
import numpy as np
from patx import feature_extraction

# Option 1: Simple Pandas DataFrame (recommended)
input_series_train = pd.DataFrame(your_data)  # Simple DataFrame
input_series_test = pd.DataFrame(your_test_data)  # Simple DataFrame

# Option 2: 1D numpy array (also works)
# input_series_train = np.array(your_data)  # 1D numpy array
# input_series_test = np.array(your_test_data)  # 1D numpy array

result = feature_extraction(
    input_series_train=input_series_train, 
    y_train=y_train, 
    input_series_test=input_series_test, 
    n_trials=100
)

# Check results
print(f"Found {len(result['patterns'])} patterns")
print(f"Pattern starts: {result['pattern_starts']}")
print(f"Pattern widths: {result['pattern_widths']}")
print(f"Similarity metrics: {result['pattern_similarity_metrics']}")
print(f"Static flags: {result['pattern_static_flags']}")
```

### Pattern Generation & Similarity Metrics

PatX uses polynomial pattern generation with intelligent similarity metrics:

- **RMSE**: Scale-dependent, sensitive to outliers - good for financial/sensor data
- **R²**: Scale-invariant, measures explained variance - good for shape patterns
- **MAE**: Scale-dependent, robust to outliers - good for noisy data  
- **Cosine**: Scale-invariant, measures shape similarity - good for directional patterns

Patterns can be matched at fixed positions (static) or anywhere in the series (sliding window). Optuna automatically optimizes all parameters for your data.

## API Reference

### feature_extraction

The main function for extracting patterns from input series data.

**Parameters:**
- `input_series_train`: Training input series data (simple DataFrame or 1D numpy array, or list for multiple input series)
- `y_train`: Training targets (Series or array)
- `input_series_test`: Test input series data (same structure as `input_series_train`)
- `initial_features`: Optional initial features (array or tuple of train/test arrays)
- `model`: Optional model instance (defaults to LightGBM based on task)
- `metric`: Optional; auto-detected (binary→auc, multiclass→accuracy, regression→rmse)
- `polynomial_degree`: Optional degree of polynomial patterns (default: 3)
- `val_size`: Optional validation split ratio (default: 0.2)
- `n_trials`: Maximum number of optimization trials (default: 300)
- `n_jobs`: Number of parallel jobs (default: -1)
- `show_progress`: Show progress bar (default: True)

**Returns:**
A dictionary containing:
- `patterns`: list of pattern arrays (just the pattern values)
- `pattern_starts`: start indices for each pattern
- `pattern_widths`: width of each pattern
- `pattern_series_indices`: which input series each pattern was extracted from
- `pattern_similarity_metrics`: similarity metric used for each pattern ('rmse', 'r2', 'mae', 'cosine')
- `pattern_static_flags`: whether each pattern uses fixed position (True) or sliding window (False)
- `train_features`: training feature matrix for the ML model
- `test_features`: test feature matrix for the ML model
- `model`: the trained model

### Data

- `load_remc_data(series)`: Load the included REMC epigenomics dataset (multiple input series)

### Custom Models

You can use any model that has `fit()`, `predict()`, and `predict_proba()` methods. Here's an example with sklearn:

**Sklearn Classifier Example:**
```python
from sklearn.linear_model import LogisticRegression
from sklearn.base import clone

class SklearnClassifierWrapper:
    def __init__(self, sklearn_model):
        self.sklearn_model = sklearn_model
    
    def fit(self, X_train, y_train, X_val=None, y_val=None):
        self.sklearn_model.fit(X_train, y_train)
        return self
    
    def predict(self, X):
        return self.sklearn_model.predict(X)
    
    def predict_proba(self, X):
        return self.sklearn_model.predict_proba(X)
    
    def clone(self):
        return SklearnClassifierWrapper(clone(self.sklearn_model))

# Use custom model
model = SklearnClassifierWrapper(LogisticRegression())
result = feature_extraction(input_series_train, y_train, input_series_test, model=model)
```

This wrapper works with any sklearn classifier (RandomForest, SVM, etc.).

## Key Features

- **Automatic Pattern Discovery**: Uses Optuna to find optimal polynomial patterns
- **Multiple Similarity Metrics**: RMSE, R², MAE, and Cosine similarity
- **Flexible Pattern Matching**: Fixed position or sliding window search
- **Scale Handling**: Automatic scale-dependent vs scale-invariant optimization
- **Multivariate Support**: Works with single or multiple input time series
- **Robust Optimization**: Handles outliers and noisy data intelligently

## Citation

If you use PatX in your research, please cite:

```bibtex
@software{patx,
  title={PatX: Pattern eXtraction for Time Series Feature Engineering},
  author={Wolber, J.},
  year={2025},
  url={https://github.com/Prgrmmrjns/patX}
}
```
