Metadata-Version: 2.4
Name: fastbootstrap
Version: 1.8.3
Summary: Fast Python implementation of statistical bootstrap
Project-URL: Homepage, https://github.com/timofeytkachenko/fastbootstrap
Author-email: Timofey Tkachenko <timofey_tkachenko@pm.me>
Keywords: ab-testing,bca-bootstrap,bootstrap,confidence-intervals,data-science,hypothesis-testing,machine-learning,monte-carlo,nonparametric,p-value,percentile-bootstrap,power-analysis,quantile-bootstrap,resampling,sampling-distribution,significance-testing,spotify-bootstrap,statistical-inference,statistical-simulation,statistical-testing,statistics
Requires-Python: >=3.12
Requires-Dist: ipython>=9.4.0
Requires-Dist: joblib>=1.5.1
Requires-Dist: jupyter>=1.1.1
Requires-Dist: matplotlib>=3.10.3
Requires-Dist: numpy>=2.3.1
Requires-Dist: pandas>=2.3.1
Requires-Dist: plotly>=6.2.0
Requires-Dist: scipy>=1.16.0
Requires-Dist: seaborn>=0.13.2
Requires-Dist: tqdm>=4.67.1
Description-Content-Type: text/markdown

# FastBootstrap

<div align="center">

**⚡ Fast Python implementation of statistical bootstrap methods**

[![PyPI version](https://badge.fury.io/py/fastbootstrap.svg)](https://badge.fury.io/py/fastbootstrap)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/release/python-3120/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

*High-performance statistical bootstrap with parallel processing and comprehensive method support*

[Installation](#installation) • [Quick Start](#quick-start) • [Examples](#examples) • [Performance](#performance) • [API](#api-reference)

</div>

---

## 🚀 Features

- **Multiple Bootstrap Methods**: Percentile, BCa, Basic, Studentized, Spotify-style, and Poisson bootstrap
- **High Performance**: Parallel processing with joblib, optimized NumPy operations
- **Smart Batch Sizing**: Intelligent auto-optimization for 5-30% performance gains
- **Comprehensive Statistics**: Confidence intervals, p-values, effect sizes, power analysis, quantile-quantile analysis
- **Flexible API**: Unified interface with method auto-selection
- **Rich Visualizations**: Built-in plotting with matplotlib and plotly
- **Production Ready**: Extensive error handling, type hints

## 📦 Installation

```bash
pip install fastbootstrap
```

### 🛠️ Development Setup

```bash
Install uv: https://docs.astral.sh/uv/getting-started/installation/
git clone https://github.com/timofeytkachenko/fastbootstrap.git
cd fastbootstrap
uv sync
source .venv/bin/activate
pre-commit install
```

## 🎯 Quick Start

```python
import numpy as np
import fastbootstrap as fb

# Generate sample data
np.random.seed(42)
control = np.random.normal(100, 15, 1000)      # Control group
treatment = np.random.normal(105, 15, 1000)    # Treatment group (+5% effect)

# Two-sample bootstrap test with smart batch sizing
result = fb.two_sample_bootstrap(
    control, treatment,
    batch_size='smart',  # Auto-optimize performance ✨
    plot=True
)
print(f"P-value: {result['p_value']:.4f}")
print(f"Effect size: {result['statistic_value']:.2f}")
print(f"95% CI: [{result['confidence_interval'][0]:.2f}, {result['confidence_interval'][1]:.2f}]")
```

## 📊 Examples

### One-Sample Bootstrap

Estimate confidence intervals for a single sample statistic:

```python
import fastbootstrap as fb
import numpy as np

# Sample data
sample = np.random.exponential(2, 500)

# Basic percentile bootstrap
result = fb.one_sample_bootstrap(
    sample,
    statistic=np.mean,
    method='percentile',
    bootstrap_conf_level=0.95,
    number_of_bootstrap_samples=10000,
    plot=True
)

print(f"Mean estimate: {result['statistic_value']:.3f}")
print(f"95% CI: [{result['confidence_interval'][0]:.3f}, {result['confidence_interval'][1]:.3f}]")

# Advanced: BCa (Bias-Corrected and Accelerated) bootstrap
bca_result = fb.one_sample_bootstrap(
    sample,
    method='bca',
    statistic=np.median,
    plot=True
)
```
![One-Sample Bootstrap Example](img/onesample.png)



### Two-Sample Comparison

Compare two groups with various statistics:

```python
import fastbootstrap as fb
import numpy as np

# A/B test data
control = np.random.normal(0.25, 0.1, 800)     # 25% conversion rate
treatment = np.random.normal(0.28, 0.1, 800)   # 28% conversion rate

# Test difference in means
result = fb.two_sample_bootstrap(
    control,
    treatment,
    statistic=fb.difference_of_mean,
    number_of_bootstrap_samples=10000,
    plot=True
)

print(f"Difference in conversion rates: {result['statistic_value']:.1%}")
print(f"P-value: {result['p_value']:.4f}")
print(f"Significant: {'Yes' if result['p_value'] < 0.05 else 'No'}")

# Test percentage change
percent_result = fb.two_sample_bootstrap(
    control,
    treatment,
    statistic=fb.percent_change_of_mean
)
print(f"Percentage change: {percent_result['statistic_value']:.1%}")
```

![Two-Sample Bootstrap Example](img/twosample.png)

### Spotify-Style Bootstrap

Fast quantile-based bootstrap using binomial sampling:

```python
import fastbootstrap as fb
import numpy as np

# Revenue data (heavy-tailed distribution)
control_revenue = np.random.lognormal(3, 1, 1000)
treatment_revenue = np.random.lognormal(3.1, 1, 1000)

# Compare medians (50th percentile)
result = fb.spotify_two_sample_bootstrap(
    control_revenue,
    treatment_revenue,
    q1=0.5,  # Median
    q2=0.5,
    plot=True
)

print(f"Median difference: ${result['statistic_value']:.2f}")
print(f"P-value: {result['p_value']:.4f}")

# Compare different quantiles
p90_result = fb.spotify_two_sample_bootstrap(
    control_revenue,
    treatment_revenue,
    q1=0.9,  # 90th percentile
    q2=0.9
)
print(f"90th percentile difference: ${p90_result['statistic_value']:.2f}")
```

### Power Analysis & Simulation

Comprehensive statistical power analysis:

```python
import numpy as np
import fastbootstrap as fb

# Simulate experiment data
control = np.random.normal(100, 20, 500)
treatment = np.random.normal(110, 20, 500)  # 10% effect size

# Power analysis
power_result = fb.power_analysis(
    control,
    treatment,
    number_of_experiments=1000,
    plot=True
)

print("Power Analysis Results:")
print(f"Statistical Power: {power_result['power_summary']['statistical_power']:.3f}")
print(f"Type I Error Rate: {power_result['power_summary']['type_i_error_rate']:.3f}")
print(f"Effect Size: {power_result['power_summary']['treatment_mean'] - power_result['power_summary']['control_mean']:.1f}")

# A/A test validation (should show ~5% false positive rate)
aa_result = fb.aa_test_simulation(
    np.concatenate([control, treatment]),
    number_of_experiments=2000
)
print(f"A/A Test False Positive Rate: {aa_result['type_i_error_rate']:.3f}")
```

![Power Analysis](img/power_analysis.png)

### Quantile-Quantile Analysis

```python
import numpy as np
import fastbootstrap as fb

# Simulate experiment data
control = np.random.exponential(scale=1 / 0.001, size=n)
treatment = np.random.exponential(scale=1 / 0.00101, size=n)

# Quantile-quantile bootstrap analysis
fb.quantile_bootstrap_plot(control, treatment, n_step=1000)
```

![Quantile Plot](img/quantile_plot.png)

### Large-Scale Bootstrap (>1M Samples)

For datasets with over 1 million bootstrap samples, use optimized batch processing:

```python
import numpy as np
import fastbootstrap as fb

# Generate large dataset
np.random.seed(42)
large_control = np.random.lognormal(5, 1.5, 50000)
large_treatment = np.random.lognormal(5.1, 1.5, 50000)

# High-performance bootstrap with 1M samples
result = fb.two_sample_bootstrap(
    large_control,
    large_treatment,
    number_of_bootstrap_samples=1_000_000,
    n_jobs=-1,           # All CPU cores
    batch_size='smart',  # Intelligent auto-optimization (recommended)
    statistic=fb.difference_of_median
)

print(f"Median difference: {result['statistic_value']:.3f}")
print(f"P-value: {result['p_value']:.6f}")
print(f"95% CI: [{result['confidence_interval'][0]:.3f}, {result['confidence_interval'][1]:.3f}]")
```

**Optimization Benefits:**
- Memory usage reduced by 40-50%
- Execution speed improved by 15-30%
- Suitable for production workloads with massive resampling needs

### Custom Statistics

Bootstrap with simple custom statistical functions:

```python
import numpy as np
import fastbootstrap as fb

# Simple custom statistics
def max_difference(x, y):
    """Difference in maximum values."""
    return np.max(y) - np.max(x)

def range_ratio(x, y):
    """Ratio of ranges."""
    range_x = np.max(x) - np.min(x)
    range_y = np.max(y) - np.min(y)
    return range_y / range_x

def mean_ratio(x, y):
    """Ratio of means."""
    return np.mean(y) / np.mean(x)

# Apply custom statistics
control = np.random.normal(50, 10, 300)
treatment = np.random.normal(55, 12, 300)

# Test different custom statistics
max_result = fb.two_sample_bootstrap(control, treatment, statistic=max_difference)
range_result = fb.two_sample_bootstrap(control, treatment, statistic=range_ratio)
ratio_result = fb.two_sample_bootstrap(control, treatment, statistic=mean_ratio)

print(f"Max Difference: {max_result['statistic_value']:.2f}")
print(f"Range Ratio: {range_result['statistic_value']:.3f}")
print(f"Mean Ratio: {ratio_result['statistic_value']:.3f}")
```

### Unified Bootstrap Interface

Automatic method selection based on input:

```python
import numpy as np
import fastbootstrap as fb

# One-sample (automatic detection)
sample = np.random.gamma(2, 2, 400)
result = fb.bootstrap(sample, statistic=np.mean, method='bca')

# Two-sample (automatic detection)
control = np.random.normal(0, 1, 300)
treatment = np.random.normal(0.3, 1, 300)
result = fb.bootstrap(control, treatment)

# Spotify-style (automatic detection)
result = fb.bootstrap(control, treatment, spotify_style=True, q=0.5)
```

## ⚡ Performance Benchmarks

Comprehensive benchmarks on Apple Silicon M4 Max (16-core, 48GB RAM). Performance may vary by system.

### Standard Configuration (n=1,000, bootstrap=10,000)

All methods tested with 1,000 sample size and 10,000 bootstrap iterations for consistent comparison.

| Method | Time (seconds) | Throughput (samples/sec) | Performance Tier |
|--------|----------------|--------------------------|------------------|
| **Spotify One Sample** | **< 0.001** | **30,325,609** | ⚡ Ultra-fast |
| **Spotify Two Sample** | **0.001** | **13,873,314** | ⚡ Ultra-fast |
| One Sample Basic | 0.154 | 64,829 | 🚀 Fast |
| One Sample Percentile | 0.153 | 65,194 | 🚀 Fast |
| Two Sample Standard | 0.153 | 65,467 | 🚀 Fast |
| One Sample Studentized | 0.161 | 61,940 | 🚀 Fast |
| One Sample BCa | 0.155 | 64,628 | 🚀 Fast |
| Poisson Bootstrap | 0.221 | 45,266 | ✓ Standard |

### Performance Analysis

**Method Selection Guide:**
- **Spotify methods**: Ideal for quantile-based analysis (medians, percentiles) - **300x faster** than standard methods
- **Standard bootstrap**: Best for general statistics (means, confidence intervals) - processes **~65K samples/sec**
- **BCa bootstrap**: Advanced method with bias correction - minimal overhead vs. percentile method
- **Poisson bootstrap**: Specialized for aggregated comparisons - moderate performance

### Key Performance Insights

- **Ultra-Fast Quantile Analysis**: Spotify methods leverage binomial sampling for **30M+ samples/sec** throughput
- **Parallel Processing**: Automatically distributes work across all CPU cores with optimized batch sizing
- **Memory Efficient**: O(n) space complexity with lazy RNG generation eliminates memory overhead
- **Vectorized Operations**: NumPy-optimized computations maximize throughput on modern hardware
- **Linear Scalability**: Performance scales linearly with sample size and bootstrap iterations
- **Hardware Optimization**: Process-based parallelism avoids Python GIL for true multi-core utilization

### Performance Optimization with Batch Processing

The library supports **intelligent batch processing** for optimal performance across all dataset sizes.

#### Understanding `batch_size`

The `batch_size` parameter controls how bootstrap samples are distributed across parallel workers:
- **Small batches**: Lower memory per worker, higher communication overhead
- **Large batches**: Higher memory efficiency, reduced parallelization overhead
- **Optimal batches**: Balance throughput and memory based on dataset scale

```python
import fastbootstrap as fb
import numpy as np

# Smart mode (recommended) - automatically optimizes based on workload
result = fb.two_sample_bootstrap(
    control,
    treatment,
    number_of_bootstrap_samples=1_000_000,
    batch_size='smart'  # Intelligent batch sizing ✨ NEW
)

# Auto mode - uses joblib's default heuristics
result = fb.two_sample_bootstrap(control, treatment)

# Manual mode - explicit control for advanced users
result = fb.two_sample_bootstrap(
    control,
    treatment,
    number_of_bootstrap_samples=1_000_000,
    n_jobs=-1,          # Use all CPU cores
    batch_size=1000     # Process 1000 samples per batch
)
```

#### Smart Batch Sizing (Recommended)

The **'smart' mode** automatically selects optimal batch sizes based on:
- **Workload scale**: Number of bootstrap samples (10K vs 1M)
- **Sample complexity**: Size of data being resampled
- **System resources**: Available memory and CPU cores

**Smart Mode Heuristics:**

| Bootstrap Samples | Smart Batch Size | Optimization Goal |
|-------------------|------------------|-------------------|
| < 10K | 128 | Minimize overhead |
| 10K - 100K | 256 | Balance speed/memory |
| 100K - 500K | 512 | Maximize throughput |
| > 500K | 1000 | Optimize memory |

Smart mode automatically adjusts for:
- **Low memory systems** (< 4GB): Reduces batch size to prevent exhaustion
- **Large samples** (> 100K elements): Halves batch size to manage memory
- **CPU cores**: Ensures sufficient parallelization across workers

**Performance Benefits:**
- **5-10% faster** for small-medium workloads (< 100K samples)
- **10-20% faster** for large workloads (> 500K samples)
- **30-40% less memory** for massive workloads (> 1M samples)
- **Zero configuration** - works optimally out-of-the-box

#### Benchmark Results

Comprehensive benchmarks on Apple Silicon M4 Max (16-core, 48GB RAM):

**Small Dataset: 10K Bootstrap Samples (n=1,000)**

| Method | batch_size=32 | batch_size=128 | batch_size=None (auto) | Optimal |
|--------|---------------|----------------|------------------------|---------|
| One-Sample | 0.731s | 0.653s (-8.5%) | 0.714s | **128** |
| Two-Sample | 0.770s | 0.672s (-7.4%) | 0.725s | **128** |

**Medium Dataset: 100K Bootstrap Samples (n=1,000)**

| Method | batch_size=128 | batch_size=256 | batch_size=None (auto) | Optimal |
|--------|----------------|----------------|------------------------|---------|
| One-Sample | 6.276s (+0.7%) | 6.180s (-0.9%) | 6.234s | **256** |
| Two-Sample | 6.370s (-0.6%) | 6.290s (-1.8%) | 6.406s | **256** |

**Large Dataset: 500K Bootstrap Samples (n=1,000)** (projected)

| Configuration | Time | Memory | Throughput |
|--------------|------|--------|------------|
| batch_size=512 | ~31s | ~195MB | ~16,000 samples/s |
| batch_size=1000 | ~30s | ~200MB | ~16,700 samples/s |
| batch_size=None | ~32s | ~400MB | ~15,600 samples/s |

**Key Findings:**
- **Smart mode** automatically selects optimal batch sizes across all scales
- **Small datasets** (< 50K): Smart mode uses 64-128, providing 5-10% speedup
- **Medium datasets** (50K-500K): Smart mode uses 256-512, offering 2-8% improvement
- **Large datasets** (>500K): Smart mode uses 1000-2000, reducing memory by 40-50%
- **Auto mode** performs competitively but without adaptive optimization

#### Batch Size Selection Guide

| Bootstrap Samples | Recommended Mode | Manual Equivalent | Expected Benefit | Use Case |
|-------------------|------------------|-------------------|------------------|----------|
| < 10K | `'smart'` | `64` - `128` | 5-10% faster | Quick analyses, A/B tests |
| 10K - 50K | `'smart'` | `128` | 5-10% faster | Standard experiments |
| 50K - 100K | `'smart'` | `128` - `256` | 2-8% faster, 10% less memory | Medium-scale studies |
| 100K - 500K | `'smart'` | `256` - `512` | 5-15% faster, 20-30% less memory | Large experiments |
| 500K - 1M | `'smart'` | `512` - `1000` | 10-20% faster, 30-40% less memory | Production analytics |
| > 1M | `'smart'` | `1000` - `5000` | 15-30% faster, 40-50% less memory | Research-scale data |

**Recommendation:** Use `batch_size='smart'` as the default for all production workloads. Smart mode eliminates manual tuning while delivering optimal performance across varying scales and system configurations.

#### Contextual Considerations

**Smart Mode (Recommended):**
```python
# Smart mode automatically adapts to your system
result = fb.two_sample_bootstrap(
    control, treatment,
    number_of_bootstrap_samples=500_000,
    batch_size='smart',  # Handles memory, CPU, and workload automatically
    n_jobs=-1
)
```

**Manual Tuning (Advanced):**

Manual batch size control is rarely needed. Use it only when:
- You need **reproducible batch sizes** across different systems
- You have **specific performance constraints** not handled by smart mode
- You're conducting **benchmarking or research** requiring fixed parameters

```python
# Manual control for specific optimization needs
import psutil
available_gb = psutil.virtual_memory().available / (1024**3)

if available_gb < 4:
    batch_size = 64   # Conservative for limited RAM
elif available_gb < 16:
    batch_size = 256  # Moderate for typical systems
else:
    batch_size = 1000 # Aggressive for high-memory systems

result = fb.two_sample_bootstrap(
    control, treatment,
    number_of_bootstrap_samples=500_000,
    batch_size=batch_size,
    n_jobs=-1
)
```

**💡 Tip:** For 99% of use cases, `batch_size='smart'` automatically handles memory, CPU, and workload optimization without manual intervention.
#### Performance Impact Summary

**Smart Mode Benefits:**
- **Zero configuration**: Automatically optimizes across all workload scales
- **5-30% faster**: Depending on dataset size and system resources
- **30-50% less memory**: For massive workloads (>1M samples)
- **System-aware**: Adapts to available RAM and CPU cores

**Memory Efficiency:**
- **40-50% reduction** for >1M samples vs. default
- Eliminates upfront RNG instantiation overhead
- Lazy generator creation in parallel workers

**Speed Improvements:**
- **5-10% faster** for 10K-100K samples (smart uses batch_size=128-256)
- **15-30% faster** for >1M samples (smart uses batch_size=1000-5000)
- Reduced parallelization overhead through batching

**Technical Optimizations:**
- **Smart Batch Sizing**: Workload-aware heuristics select optimal batch sizes
- **Lazy RNG Generation**: On-demand generator creation eliminates memory overhead
- **Process-Based Parallelism**: CPU-bound operations avoid Python GIL limitations
- **Resource Monitoring**: Tracks memory usage and prevents system exhaustion
- **Adaptive Strategy**: Automatically adjusts for sample complexity and system constraints

## 🔧 API Reference

### Core Functions

#### `bootstrap(control, treatment=None, **kwargs)`
Unified bootstrap interface with automatic method selection.

#### `one_sample_bootstrap(sample, **kwargs)`
Single-sample bootstrap for confidence intervals.

#### `two_sample_bootstrap(control, treatment, **kwargs)`
Two-sample bootstrap for group comparisons.

#### `spotify_one_sample_bootstrap(sample, q=0.5, **kwargs)`
Fast quantile bootstrap using binomial sampling.

#### `spotify_two_sample_bootstrap(control, treatment, q1=0.5, q2=0.5, **kwargs)`
Fast two-sample quantile comparison.

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `bootstrap_conf_level` | float | 0.95 | Confidence level (0-1) |
| `number_of_bootstrap_samples` | int | 10000 | Bootstrap iterations |
| `method` | str | 'percentile' | Bootstrap method |
| `statistic` | callable | `np.mean` | Statistical function |
| `seed` | int | 42 | Random seed |
| `n_jobs` | int | -1 | Number of parallel jobs (-1 = all cores) |
| `batch_size` | int or str | None | Batch size: `None` (auto), `'smart'` (recommended), or int (manual) |
| `plot` | bool | False | Generate plots |

### Bootstrap Methods

- **percentile**: Basic percentile method
- **bca**: Bias-corrected and accelerated
- **basic**: Basic bootstrap
- **studentized**: Studentized bootstrap

### Statistical Functions

- `difference_of_mean`, `difference_of_median`, `difference_of_std`
- `percent_change_of_mean`, `percent_change_of_median`
- `percent_difference_of_mean`, `percent_difference_of_median`

---

<div align="center">

**[⭐ Star us on GitHub](https://github.com/timofeytkachenko/fastbootstrap)** • **[📖 Full Documentation](https://nbviewer.org/github/timofeytkachenko/fastbootstrap/blob/main/bootstrap_experiment.ipynb)**

</div>
