Metadata-Version: 2.4
Name: mostlyai
Version: 5.2.2
Summary: Synthetic Data SDK
Project-URL: homepage, https://app.mostly.ai/
Project-URL: repository, https://github.com/mostly-ai/mostlyai
Project-URL: documentation, https://mostly-ai.github.io/mostlyai/
Author-email: MOSTLY AI <dev@mostly.ai>
License-Expression: Apache-2.0
License-File: LICENSE
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Telecommunications Industry
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries
Classifier: Typing :: Typed
Requires-Python: <3.14,>=3.10
Requires-Dist: duckdb>=1.2.1
Requires-Dist: environs>=9.5.0
Requires-Dist: greenlet>=3.1.1
Requires-Dist: gunicorn>=23.0.0
Requires-Dist: httpx>=0.25.0
Requires-Dist: ipywidgets>=8.1.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: psutil>=5.9.5
Requires-Dist: pyarrow>=16.0.0
Requires-Dist: pycryptodomex>=3.10.0
Requires-Dist: pydantic<3,>=2.4.2
Requires-Dist: requests>=2.31.0
Requires-Dist: rich>=13.7.0
Requires-Dist: schema>=0.7.5
Requires-Dist: semantic-version>=2.10.0
Requires-Dist: smart-open>=6.0.0
Requires-Dist: sqlparse>=0.5.3
Requires-Dist: typer>=0.9.0
Requires-Dist: xxhash>=3.2.0
Provides-Extra: databricks
Requires-Dist: databricks-sql-connector<4,>=3.2.0; extra == 'databricks'
Provides-Extra: googlebigquery
Requires-Dist: sqlalchemy-bigquery<2,>=1.6.1; extra == 'googlebigquery'
Provides-Extra: hive
Requires-Dist: impyla<0.20,>=0.19.0; extra == 'hive'
Requires-Dist: kerberos<2,>=1.3.1; extra == 'hive'
Requires-Dist: pyhive[hive-pure-sasl]<0.8,>=0.7.0; extra == 'hive'
Provides-Extra: local
Requires-Dist: adlfs>=2023.4.0; extra == 'local'
Requires-Dist: azure-storage-blob>=12.16.0; extra == 'local'
Requires-Dist: cloudpathlib[azure,gs,s3]>=0.17.0; extra == 'local'
Requires-Dist: fastapi<0.116,>=0.115.6; extra == 'local'
Requires-Dist: filelock>=3.16.1; extra == 'local'
Requires-Dist: gcsfs>=2023.1.0; extra == 'local'
Requires-Dist: joblib>=1.4.2; extra == 'local'
Requires-Dist: mostlyai-engine==1.5.4; extra == 'local'
Requires-Dist: mostlyai-qa==1.9.8; extra == 'local'
Requires-Dist: networkx<4,>=3.0; extra == 'local'
Requires-Dist: openpyxl>=3.1.5; extra == 'local'
Requires-Dist: python-multipart>=0.0.20; extra == 'local'
Requires-Dist: s3fs>=2023.1.0; extra == 'local'
Requires-Dist: smart-open[azure,gcs,s3]>=6.3.0; extra == 'local'
Requires-Dist: sqlalchemy>=2.0.0; extra == 'local'
Requires-Dist: sshtunnel<0.5,>=0.4.0; extra == 'local'
Requires-Dist: torch<2.7.1,>=2.7.0; extra == 'local'
Requires-Dist: torchaudio<2.7.1,>=2.7.0; extra == 'local'
Requires-Dist: torchvision<0.22.1,>=0.22.0; extra == 'local'
Requires-Dist: uvicorn<0.35,>=0.34.0; extra == 'local'
Requires-Dist: xlsxwriter<4,>=3.1.9; extra == 'local'
Provides-Extra: local-gpu
Requires-Dist: adlfs>=2023.4.0; extra == 'local-gpu'
Requires-Dist: azure-storage-blob>=12.16.0; extra == 'local-gpu'
Requires-Dist: cloudpathlib[azure,gs,s3]>=0.17.0; extra == 'local-gpu'
Requires-Dist: fastapi<0.116,>=0.115.6; extra == 'local-gpu'
Requires-Dist: filelock>=3.16.1; extra == 'local-gpu'
Requires-Dist: gcsfs>=2023.1.0; extra == 'local-gpu'
Requires-Dist: joblib>=1.4.2; extra == 'local-gpu'
Requires-Dist: mostlyai-engine[gpu]==1.5.4; extra == 'local-gpu'
Requires-Dist: mostlyai-qa==1.9.8; extra == 'local-gpu'
Requires-Dist: networkx<4,>=3.0; extra == 'local-gpu'
Requires-Dist: openpyxl>=3.1.5; extra == 'local-gpu'
Requires-Dist: python-multipart>=0.0.20; extra == 'local-gpu'
Requires-Dist: s3fs>=2023.1.0; extra == 'local-gpu'
Requires-Dist: smart-open[azure,gcs,s3]>=6.3.0; extra == 'local-gpu'
Requires-Dist: sqlalchemy>=2.0.0; extra == 'local-gpu'
Requires-Dist: sshtunnel<0.5,>=0.4.0; extra == 'local-gpu'
Requires-Dist: torch<2.7.1,>=2.7.0; extra == 'local-gpu'
Requires-Dist: torchaudio<2.7.1,>=2.7.0; extra == 'local-gpu'
Requires-Dist: torchvision<0.22.1,>=0.22.0; extra == 'local-gpu'
Requires-Dist: uvicorn<0.35,>=0.34.0; extra == 'local-gpu'
Requires-Dist: xlsxwriter<4,>=3.1.9; extra == 'local-gpu'
Provides-Extra: mssql
Requires-Dist: pyodbc<6,>=5.1.0; extra == 'mssql'
Provides-Extra: mysql
Requires-Dist: mysql-connector-python<10,>=9.1.0; extra == 'mysql'
Provides-Extra: oracle
Requires-Dist: oracledb<3,>=2.2.1; extra == 'oracle'
Provides-Extra: postgres
Requires-Dist: psycopg2<3,>=2.9.4; extra == 'postgres'
Provides-Extra: snowflake
Requires-Dist: snowflake-sqlalchemy<2,>=1.6.1; extra == 'snowflake'
Description-Content-Type: text/markdown

# Synthetic Data SDK ✨

[![GitHub Release](https://img.shields.io/github/v/release/mostly-ai/mostlyai)](https://github.com/mostly-ai/mostlyai/releases)
[![Documentation](https://img.shields.io/badge/docs-latest-green)](https://mostly-ai.github.io/mostlyai/)
[![PyPI Downloads](https://static.pepy.tech/badge/mostlyai)](https://pepy.tech/projects/mostlyai)
[![License](https://img.shields.io/github/license/mostly-ai/mostlyai)](https://github.com/mostly-ai/mostlyai/blob/main/LICENSE)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mostlyai)](https://pypi.org/project/mostlyai/)
[![GitHub stars](https://img.shields.io/github/stars/mostly-ai/mostlyai?style=social)](https://github.com/mostly-ai/mostlyai/stargazers)

[Documentation](https://mostly-ai.github.io/mostlyai/) | [Technical White Paper](https://arxiv.org/abs/2508.00718) | [Usage Examples](https://mostly-ai.github.io/mostlyai/usage/) | [Free Cloud Service](https://app.mostly.ai/)

The **Synthetic Data SDK** is a Python toolkit for high-fidelity, privacy-safe **Synthetic Data**.

- **LOCAL** mode trains and generates synthetic data locally on your own compute resources.
- **CLIENT** mode connects to a remote MOSTLY AI platform for training & generating synthetic data there.
- Generators, that were trained locally, can be easily imported to a platform for further sharing.

## Overview

The SDK allows you to programmatically create, browse and manage 3 key resources:

1. **Generators** - Train a synthetic data generator on your existing tabular or language data assets
2. **Synthetic Datasets** - Use a generator to create any number of synthetic samples to your needs
3. **Connectors** - Connect to any data source within your organization, for reading and writing data

| Intent                                        | Primitive                         | API Reference                                                                                                 |
|-----------------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------------------------|
| Train a Generator on tabular or language data | `g = mostly.train(config)`        | [mostly.train](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.train)       |
| Generate any number of synthetic data records | `sd = mostly.generate(g, config)` | [mostly.generate](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.generate) |
| Live probe the generator on demand            | `df = mostly.probe(g, config)`    | [mostly.probe](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.probe)       |
| Connect to any data source within your org    | `c = mostly.connect(config)`      | [mostly.connect](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.connect)   |

https://github.com/user-attachments/assets/9e233213-a259-455c-b8ed-d1f1548b492f

## Key Features

- **Broad Data Support**
  - Mixed-type data (categorical, numerical, geospatial, text, etc.)
  - Single-table, multi-table, and time-series
- **Multiple Model Types**
  - State-of-the-art performance via TabularARGN
  - Fine-tune Hugging Face hosted language models
  - Efficient LSTM for text synthesis from scratch
- **Advanced Training Options**
  - GPU/CPU support
  - Differential Privacy
  - Progress Monitoring
- **Automated Quality Assurance**
  - Quality metrics for fidelity and privacy
  - In-depth HTML reports for visual analysis
- **Flexible Sampling**
  - Up-sample to any data volumes
  - Conditional simulations based on any columns
  - Re-balance underrepresented segments
  - Context-aware data imputation
  - Statistical fairness controls
  - Rule-adherence via temperature
- **Seamless Integration**
  - Connect to external data sources (DBs, cloud storages)
  - Fully permissive open-source license

## Quick Start <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/getting-started/getting-started.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

Install the SDK via `pip` (see [Installation](#installation) for further details):

```shell
pip install -U mostlyai  # or 'mostlyai[local]' for LOCAL mode
```

Generate synthetic samples using a pre-trained generator:

```python
# initialize the SDK
from mostlyai.sdk import MostlyAI
mostly = MostlyAI()

# import a trained generator
g = mostly.generators.import_from_file(
  "https://github.com/mostly-ai/public-demo-data/raw/dev/census/census-generator.zip"
)

# probe for 1000 representative synthetic samples
df = mostly.probe(g, size=1000)
df
```

Generate synthetic samples based on fixed column values:

```python
# create 10k records of 24y male respondents
df = mostly.probe(g, seed=[{"age": 24, "sex": "Male"}] * 10_000)
df
```

And now train your very own synthetic data generator:

```python
# load original data
import pandas as pd
original_df = pd.read_csv(
  "https://github.com/mostly-ai/public-demo-data/raw/dev/titanic/titanic.csv"
)

# train a single-table generator, with default configs
g = mostly.train(
  name="Quick Start Demo - Titanic",
  data=original_df,
)

# display the quality assurance report
g.reports(display=True)

# generate a representative synthetic dataset, with default configs
sd = mostly.generate(g)
df = sd.data()

# or simply probe for some samples
df = mostly.probe(g, size=100)
df
```

## Performance

The SDK is being developed with a focus on efficiency, accuracy, and flexibility, with best-in-class performance across all three. Results will ultimately depend on the training data itself (size, structure, and content), on the available compute (CPU vs GPU), as well as on the chosen training configurations (model, epochs, samples, etc.). Thus, a crawl / walk / run approach is recommended — starting with a subset of samples training for a limited amount of time, to then gradually scale up, to yield optimal results for use case at hand.

### Tabular Models

Tabular models within the SDK are built on TabularARGN ([arXiv:2501.12012](https://arxiv.org/abs/2501.12012)), which achieves best-in-class synthetic data quality while being 1–2 orders of magnitude more efficient than comparable models. This efficiency enables the training and generation of millions of synthetic records within minutes, even on CPU environments.

![TabularARGN Benchmark](https://raw.githubusercontent.com/mostly-ai/mostlyai/refs/heads/main/docs/TabularARGN-benchmark.png)

### Language Models

The default language model is a basic, non-pre-trained LSTM (`LSTMFromScratch-3m`), particularly effective for textual data with limited scope (short lengths, narrow variety) and sufficient training samples.

Alternatively, any pre-trained language model, that is available via the [Hugging Face Hub](https://huggingface.co/) and that supports the [AutoModelForCausalLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForCausalLM) class, can be selected to be then fine-tuned on the provided training data. These models start out already with a general world knowledge, and then adapt to the training data for generating high-fidelity synthetic samples even in sparse data domains. The final performance will once again largely depend on the chosen model configurations.

In either case, a modern GPU is highly recommended when working with language models.

## Installation

Use `pip` (or better `uv pip`) to install the official `mostlyai` package via PyPI. Python 3.10 or higher is required.

It is highly recommended to install the package within a dedicated virtual environment using `uv` (see [here](https://docs.astral.sh/uv/)):

<details>

  <summary>Setup of <code>uv</code> on Unix / macOS</summary>

```shell
# Install uv if you don't have it yet
curl -Ls https://astral.sh/uv/install.sh | bash

# Create and activate a Python 3.12 environment with uv
mkdir ~/synthetic-data-sdk; cd ~/synthetic-data-sdk
uv venv -p 3.12

# Activate virtual environment
source .venv/bin/activate
```

</details>

<details>

  <summary>Setup of <code>uv</code> on Windows</summary>

```shell
# Install uv if you don't have it yet
irm https://astral.sh/uv/install.ps1 | iex

# Create and activate a Python 3.12 environment with uv
mkdir ~/synthetic-data-sdk; cd ~/synthetic-data-sdk
uv venv -p 3.12

# Activate virtual environment
.venv\Scripts\activate
```

</details>

<details>

  <summary>Run Jupyter Lab session via <code>uv</code></summary>

```shell
# Optionally launch jupyter session after SDK installation
uv run --with jupyter jupyter lab
```

</details>

### CLIENT mode

This is a light-weight installation for using the SDK in CLIENT mode only. It communicates to a MOSTLY AI platform to perform requested tasks. See e.g. [app.mostly.ai](https://app.mostly.ai/) for a free-to-use hosted version.

```shell
uv pip install -U mostlyai
```

### CLIENT + LOCAL mode

This is a full installation for using the SDK in both CLIENT and LOCAL mode. It includes all dependencies, incl. PyTorch, for training and generating synthetic data locally.

```shell
uv pip install -U 'mostlyai[local]'
```

or alternatively for a GPU setup on Linux (needed for LLM finetuning and inference):

```shell
uv pip install -U 'mostlyai[local-gpu]'
```

On Linux, one can explicitly install the CPU-only variant of torch together with `mostlyai[local]`:

```shell
# uv pip install
uv pip install --index-strategy unsafe-first-match -U torch==2.7.0+cpu torchvision==0.22.0+cpu 'mostlyai[local]' --extra-index-url https://download.pytorch.org/whl/cpu
```

```shell
# standard pip install
pip install -U torch==2.7.0+cpu torchvision==0.22.0+cpu 'mostlyai[local]' --extra-index-url https://download.pytorch.org/whl/cpu
```


> **Note for Google Colab users**: Installing any of the local extras (`mostlyai[local]`, or `mostlyai[local-gpu]`) might need restarting the runtime after installation for the changes to take effect.

### Data Connectors

Add any of the following extras for further data connectors support in LOCAL mode: `databricks`, `googlebigquery`, `hive`, `mssql`, `mysql`, `oracle`, `postgres`, `snowflake`. E.g.

```shell
uv pip install -U 'mostlyai[local, databricks, snowflake]'
```

### Using Docker

As an alternative, you can also build a Docker image, which provides you with an isolated environment for running the SDK in LOCAL mode, with all connector dependencies pre-installed. This approach ensures a consistent runtime environment across all systems. Before proceeding, make sure [Docker](https://docs.docker.com/get-started/get-docker/) is installed on your system.

<details>

  <summary>Get the image</summary>

  - **Pull from official repository**

    `docker pull --platform=linux/amd64 ghcr.io/mostly-ai/sdk`

  - **(Optional) Build your own image**

    If your environment is capable of executing Makefile (see [here](https://github.com/mostly-ai/mostlyai/blob/main/Makefile#L47-L73)), then execute `make docker-build`.

    Otherwise, use `docker buildx build . --platform=linux/amd64 -t ghcr.io/mostly-ai/sdk` instead.

</details>

<details>

  <summary>Start the container</summary>

  This will launch the SDK in LOCAL mode on port 8080 inside the container.

  If your environment is capable of executing Makefile, then execute `make docker-run`. Or `make docker-run HOST_PORT=8080` to forward to a host port of your choice. One could also mount the `local_dir` via `make docker-run HOST_LOCAL_DIR=/path/to/host/folder` to make the generators and synthetic datasets directly accessible from the host.

  Otherwise, use `docker run --platform=linux/amd64 -p 8080:8080 ghcr.io/mostly-ai/sdk` instead. Optionally, you can use the `-v` flag to mount a [volume](https://docs.docker.com/engine/storage/volumes/#syntax) for passing files between the host and the container.

</details>

<details>

  <summary>Connect to the container</summary>

  You can now connect to the SDK running within the container by initializing the SDK in `CLIENT` mode on the host machine.

  ```python
  from mostlyai.sdk import MostlyAI

  mostly = MostlyAI(base_url="http://localhost:8080")
  ```

</details>

### Air-gapped Environments

For air-gapped environments (without internet access), you must install the package using the provided wheel files, including any optional dependencies you require.

If your application depends on a Hugging Face language model, you’ll also need to manually download and transfer the model files.

<details>

  <summary>Download models from Hugging Face Hub</summary>

<p>On a machine with internet access, run the following Python script, to download the Hugging Face model to your local Hugging Face cache.</p>

```python
#! uv pip install huggingface-hub
from pathlib import Path
from huggingface_hub import snapshot_download
path = snapshot_download(
    repo_id="Qwen/Qwen2.5-Coder-0.5B",  # change accordingly
    token=None,  # insert your HF TOKEN for gated models
)
print(f"COPY `{Path(path).parent.parent}`")
```

Next, transfer the printed directory to the air-gapped environment's cache directory located at `~/.cache/huggingface/hub/` (respectively to `HF_HOME`, if that environment variable has been set).

</details>


## Citation

Please consider citing our project if you find it useful:

```bibtex
@misc{mostlyai,
      title={Democratizing Tabular Data Access with an Open-Source Synthetic-Data SDK},
      author={Ivona Krchova and Mariana Vargas Vieyra and Mario Scriminaci and Andrey Sidorenko},
      year={2025},
      eprint={2508.00718},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.00718},
}
```
