# 1. Abstract

In Data Science, the goal is often to explain the target variable `Y` in terms of the input `X` as :

- Y = f_alpha(X) + epsilon

  where:

  - `epsilon ~ N(0,1)` is Gaussian noise,
  - `X` is an `n x p` data matrix,
  - `Y` is an `n x 1` vector,
  - and `alpha` represents the model parameters.

The function `f` can be linear or nonlinear, for instance implemented as a neural network. In this case, `alpha` corresponds to the set of weights and biases of the network.

Sylvain Sardy (ref) proposed a relaxation technique to estimate the parameters `alpha`:

- alpha_hat = argmin_alpha (||Y-f_alpha(X)||\_2 + lambda \* ||alpha||\_1 )

One of the main challenges is choosing the regularization parameter `lambda` :

- If `lambda` is too large, the resulting model will be overly sparse and inaccurate.
- If `lambda` is too small, the model will be accurate but not sparse enough.

The goal is to find the best trade-off :

- The package implements Sardy’s algorithm to compute the optimal `lambda`.
- The implementation automatically runs on one or multiple GPUs if available, or on the CPU otherwise.
- An auto-detection feature ensures the best use of the available hardware.

The optimal value, referred to as the **“lambda happy”**, is computed as :

- lambda_happy = quantile_0.95( || X^T \* Z_centered ||\_∞ / || Z_centered ||\_2 )

- where the numerator uses the Chebyshev (L∞) norm and the denominator the Euclidean (L2) norm.

- Here, `Z_centered` is an `n x m` random matrix (with `m` typically large enough for accurate quantile estimation).

- Each column of `Z` is drawn independently from `N(0,1)` and then centered (its mean is subtracted so that every column has zero mean).

A version including support for Mac and Windows platforms has been developed for wider use without specific CUDA code, based on the PyTorch library. See: https://pypi.org/project/torch-lambda-happy/

> ℹ️ This package is currently compatible only with Python 3.10 on Linux x86_64.

# 2. Installation

Here is how to install the package with its dependencies. The torch library must be installed separately depending on the operating system.

## 2.1 Install the `lambda-happy` library :

Only the lambda-happy package below is required.
The others are optional and can be used for benchmarking or validation.
In all cases, however, you must install the PyTorch dependency described below.

```bash
# Core functionality
pip install lambda-happy

# Benchmark GUI (PyQt5)
pip install lambda-happy[benchmark]

# Validation tools (PyQt5 + pandas)
pip install lambda-happy[validation]

# All extras (Benchmark + Validation tools)
pip install lambda-happy[all]
```

## 2.2 Dependencies

The backend relies on PyTorch and CUDA, depending on your hardware.
You must therefore install the corresponding PyTorch version.

- Linux (successfully tested)

```bash
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu118
```

> ℹ️ This project was developed on Ubuntu 22.04 with CUDA 11.8 (2025).

# 3. Examples and recommendations

## 3.1 Recommended use case

Here is an example using best practices. It is always preferable to create the matrix on the correct device to avoid unnecessary conversion.

```py
import torch
from lambda_happy import LambdaHappy

# Prepare data
X = torch.randn(1000, 5000, device="cuda")

# Initialize solver (auto‐select the fastest backend)
solver = LambdaHappy(X, force_fastest=True)

# Single estimate
lambda_value = solver.compute(m=10000)
print(f"lambda_value: {lambda_value:.4f}")

# Multiple runs
lambda_values = solver.compute_many(m=10000, nb_run=50)
print(f"lambda_values: {lambda_values}")

# Aggregated (median)
lambda_median = solver.compute_agg(m=10000, nb_run=500, func=torch.median)
print(f"lambda_median: {lambda_median:.4f}")
```

## 3.2 Example with all parameters (single estimation)

```py
import torch
from lambda_happy import LambdaHappy

matX = torch.randn(1_000, 1_000)
model = LambdaHappy(X=matX, force_fastest=False, use_multigpu=False)
lambda_value = model.compute(m=10_000, version="AUTOMATIC", dtype=torch.float16, device_type="cuda")
print(f"Estimated lambda: {lambda_value:.4f}")

```

## 3.3 Example with all parameters (many estimations)

```py
import torch
from lambda_happy import LambdaHappy

matX = torch.randn(1_000, 1_000)
model = LambdaHappy(X=matX, force_fastest=True, use_multigpu=False)
lambda_values = model.compute_many(m=10_000, version="AUTOMATIC", dtype=torch.float32, device_type="cuda", nb_run=100)
print(f"Estimated lambdas: {lambda_values}")
```

## 3.4 Example with all parameters (aggregated estimation)

```py

import torch
from lambda_happy import LambdaHappy

matX = torch.randn(1_000, 1_000)
model = LambdaHappy(X=matX, force_fastest=True, use_multigpu=True)
lambda_mean = model.compute_agg(
    m=10_000, version="AUTOMATIC", dtype=torch.float32, device_type="cpu", nb_run=10, func=torch.mean
)
print(f"Estimated lambda: {lambda_mean:.4f}")

```

> ⚠️ The examples above illustrate different ways of using the library, but they are not necessarily the fastest methods.  
> For the most efficient versions, please refer to the `3.1 Recommended use case` section.

> ℹ️ Use `float16` (or `force_fastest=True`) on **GPU** only if the input matrix **X** is normalized.
> Setting `use_multigpu=True` will utilize all available GPUs if more than one is present.

## 3.5 Recommended Settings

| Context    | Data Type     | Notes                                                                  |
| ---------- | ------------- | ---------------------------------------------------------------------- |
| CPU        | `float32`     | Stable, widely supported, and generally the fastest on CPU.            |
| GPU (CUDA) | `float16`     | High performance if `X` is normalized; otherwise use `float32`.        |
| Backend    | `"AUTOMATIC"` | Selects the best available implementation based on hardware and dtype. |

# 4. Performance Trade-Offs

## 4.1 Projection Dimension (m)

- ↑ **m** → improves lambda_happy precision.
- ↑ **m** → linearly increases compute time (all kernels scale with m).
- Recommended: **m = 10_000** provides good accuracy in most cases.

> ℹ️ Use `float16` on **GPU** only if the input matrix **X** is normalized.
> Otherwise, lambda_happy estimation may be unstable or inconsistent.

## 4.2 Sample Dimension (n)

- ↑ **n** → increases cost in all kernels (since Z ∈ R^(n × m)), except for the quantile post-processing step.

## 4.3 Feature Dimension (p)

- ↑ **p** → only affects the **X^T·Z** matrix multiplication.

# 5. Benchmark

The `lambda-happy-benchmark` script measures and compares the performance of LambdaHappy on CPU and GPU.
It offers various benchmarking options and displays live throughput plots.
Example usage :

```sh
lambda-happy-benchmark --benchmark_2D --benchmark_3D --benchmark_float --device cuda --dtype float32 -n 1000 -p 1000 -m 10000
```

This runs a 2D benchmark using CUDA with specified matrix dimensions and then run a 3D benchmark.

> ℹ️ Note: Not all hyperparameters are used for every plot, but if provided, they will be applied when relevant.

# 6. Validation

The `lambda-happy-validation` script runs tests to validate lambda_happy estimation accuracy.
It generates detailed reports and distribution plots using pandas and PyQt5.

Example usage :

```sh
lambda-happy-validation --distribution_small --distribution_large --device cuda --dtype float32 -n 1000 -p 1000
```

This plots small and large scale lambda_happy distributions on CUDA for the given parameters.

# 7. Performance comparison

Here are the results for a CUDA calculation :
| Rank | Mode | Version | Precision | FPS | Speed-up |
|-|-|-|-|-|-|
| 1 | Mono-GPU | SMART_TENSOR | Float32 | 449 | 1.00x |
| 2 | Mono-GPU | GPU_DEDICATED | Float32 | 501 | 1.12x |
| 3 | Multi-GPU| SMART_TENSOR | Float32 | 511 | 1.14x |
| 4 | Multi-GPU| SMART_TENSOR | Float16 | 664 | 1.48x |
| 5 | Multi-GPU| GPU_DEDICATED | Float32 | 911 | 2.03x |
| 6 | Mono-GPU | SMART_TENSOR | Float16 | 1215 | 2.71x |
| 7 | Mono-GPU | GPU_DEDICATED | Float16 | 1618 | 3.60x |
| 8 | Multi-GPU| GPU_DEDICATED | Float16 | 2104 | 4.69x |

> ℹ️ FPS : number of times the lambda_happy value is estimated per second.

The test server is equipped with an Intel Xeon E5-2699 v3 processor (2014) and three NVIDIA GeForce RTX 2080 Ti graphics cards (2018).

The evaluation uses the default parameters, with X of size 1000x1000 and m=10000.

> ℹ️ Note: Use device="cuda" when you create X.

# 8. About This Project

This package, including performance optimizations, was developed as part of a Bachelor’s thesis at HE-Arc by Sevan Yerly (sevan.yerly@he-arc.ch), under the supervision of Cédric Bilat (cedric.bilat@he-arc.ch). The mathematical foundations were developed by Sylvain Sardy (sylvain.sardy@unige.ch).

For questions or contact : sevan.yerly@he-arc.ch or cedric.bilat@he-arc.ch
