Metadata-Version: 2.4
Name: mostlyai
Version: 4.2.2
Summary: Synthetic Data SDK
Project-URL: homepage, https://app.mostly.ai/
Project-URL: repository, https://github.com/mostly-ai/mostlyai
Project-URL: documentation, https://mostly-ai.github.io/mostlyai/
Author-email: MOSTLY AI <dev@mostly.ai>
License-Expression: Apache-2.0
License-File: LICENSE
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Telecommunications Industry
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries
Classifier: Typing :: Typed
Requires-Python: <3.14,>=3.10
Requires-Dist: environs>=9.5.0
Requires-Dist: filelock>=3.16.1
Requires-Dist: greenlet<4,>=3.1.1
Requires-Dist: gunicorn>=23.0.0
Requires-Dist: httpx>=0.25.0
Requires-Dist: ipywidgets>=8.1.0
Requires-Dist: joblib>=1.2.0
Requires-Dist: networkx<4,>=3.0
Requires-Dist: openpyxl>=3.1.5
Requires-Dist: pandas>=2.0.0
Requires-Dist: psutil>=5.9.5
Requires-Dist: pyarrow>=16.0.0
Requires-Dist: pycryptodomex<4,>=3.10.0
Requires-Dist: pydantic<3,>=2.4.2
Requires-Dist: requests>=2.31.0
Requires-Dist: rich>=13.7.0
Requires-Dist: schema>=0.7.5
Requires-Dist: semantic-version>=2.10.0
Requires-Dist: smart-open>=6.0.0
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: sshtunnel<0.5,>=0.4.0
Requires-Dist: typer>=0.9.0
Requires-Dist: xlsxwriter<4,>=3.1.9
Requires-Dist: xxhash>=3.2.0
Provides-Extra: databricks
Requires-Dist: databricks-sql-connector<4,>=3.2.0; extra == 'databricks'
Provides-Extra: googlebigquery
Requires-Dist: sqlalchemy-bigquery<2,>=1.6.1; extra == 'googlebigquery'
Provides-Extra: hive
Requires-Dist: impyla<0.20,>=0.19.0; extra == 'hive'
Requires-Dist: kerberos<2,>=1.3.1; extra == 'hive'
Requires-Dist: pyhive[hive-pure-sasl]<0.8,>=0.7.0; extra == 'hive'
Provides-Extra: local
Requires-Dist: adlfs>=2023.4.0; extra == 'local'
Requires-Dist: azure-storage-blob>=12.16.0; extra == 'local'
Requires-Dist: cloudpathlib[azure,gs,s3]>=0.17.0; extra == 'local'
Requires-Dist: fastapi<0.116,>=0.115.6; extra == 'local'
Requires-Dist: gcsfs>=2023.1.0; extra == 'local'
Requires-Dist: mostlyai-engine==1.1.4; extra == 'local'
Requires-Dist: mostlyai-qa==1.5.5; extra == 'local'
Requires-Dist: python-multipart>=0.0.20; extra == 'local'
Requires-Dist: s3fs>=2023.1.0; extra == 'local'
Requires-Dist: smart-open[azure,gcs,s3]>=6.3.0; extra == 'local'
Requires-Dist: torch<2.6.0,>=2.5.1; extra == 'local'
Requires-Dist: uvicorn<0.35,>=0.34.0; extra == 'local'
Provides-Extra: local-cpu
Requires-Dist: adlfs>=2023.4.0; extra == 'local-cpu'
Requires-Dist: azure-storage-blob>=12.16.0; extra == 'local-cpu'
Requires-Dist: cloudpathlib[azure,gs,s3]>=0.17.0; extra == 'local-cpu'
Requires-Dist: fastapi<0.116,>=0.115.6; extra == 'local-cpu'
Requires-Dist: gcsfs>=2023.1.0; extra == 'local-cpu'
Requires-Dist: mostlyai-engine[cpu]==1.1.4; extra == 'local-cpu'
Requires-Dist: mostlyai-qa[cpu]==1.5.5; extra == 'local-cpu'
Requires-Dist: python-multipart>=0.0.20; extra == 'local-cpu'
Requires-Dist: s3fs>=2023.1.0; extra == 'local-cpu'
Requires-Dist: smart-open[azure,gcs,s3]>=6.3.0; extra == 'local-cpu'
Requires-Dist: torch<2.6.0,>=2.5.1; (sys_platform != 'linux') and extra == 'local-cpu'
Requires-Dist: torch==2.5.1+cpu; (sys_platform == 'linux') and extra == 'local-cpu'
Requires-Dist: uvicorn<0.35,>=0.34.0; extra == 'local-cpu'
Provides-Extra: local-gpu
Requires-Dist: adlfs>=2023.4.0; extra == 'local-gpu'
Requires-Dist: azure-storage-blob>=12.16.0; extra == 'local-gpu'
Requires-Dist: cloudpathlib[azure,gs,s3]>=0.17.0; extra == 'local-gpu'
Requires-Dist: fastapi<0.116,>=0.115.6; extra == 'local-gpu'
Requires-Dist: gcsfs>=2023.1.0; extra == 'local-gpu'
Requires-Dist: mostlyai-engine[gpu]==1.1.4; extra == 'local-gpu'
Requires-Dist: mostlyai-qa[gpu]==1.5.5; extra == 'local-gpu'
Requires-Dist: python-multipart>=0.0.20; extra == 'local-gpu'
Requires-Dist: s3fs>=2023.1.0; extra == 'local-gpu'
Requires-Dist: smart-open[azure,gcs,s3]>=6.3.0; extra == 'local-gpu'
Requires-Dist: torch<2.6.0,>=2.5.1; extra == 'local-gpu'
Requires-Dist: uvicorn<0.35,>=0.34.0; extra == 'local-gpu'
Provides-Extra: mssql
Requires-Dist: pyodbc<6,>=5.1.0; extra == 'mssql'
Provides-Extra: mysql
Requires-Dist: mysql-connector-python<10,>=9.1.0; extra == 'mysql'
Provides-Extra: oracle
Requires-Dist: oracledb<3,>=2.2.1; extra == 'oracle'
Provides-Extra: postgres
Requires-Dist: psycopg2<3,>=2.9.4; extra == 'postgres'
Provides-Extra: snowflake
Requires-Dist: snowflake-sqlalchemy<2,>=1.6.1; extra == 'snowflake'
Description-Content-Type: text/markdown


# Synthetic Data SDK ✨

[![Documentation](https://img.shields.io/badge/docs-latest-green)](https://mostly-ai.github.io/mostlyai/) 
[![PyPI Downloads](https://static.pepy.tech/badge/mostlyai)](https://pepy.tech/projects/mostlyai) 
[![License](https://img.shields.io/github/license/mostly-ai/mostlyai)](https://github.com/mostly-ai/mostlyai/blob/main/LICENSE) 
[![GitHub Release](https://img.shields.io/github/v/release/mostly-ai/mostlyai)](https://github.com/mostly-ai/mostlyai/releases) 
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mostlyai)](https://pypi.org/project/mostlyai/) 
[![GitHub stars](https://img.shields.io/github/stars/mostly-ai/mostlyai?style=social)](https://github.com/mostly-ai/mostlyai/stargazers)

[SDK Documentation](https://mostly-ai.github.io/mostlyai/) | [Platform Documentation](https://mostly.ai/docs) | [Usage Examples](https://mostly-ai.github.io/mostlyai/usage/)

The Synthetic Data SDK is a Python toolkit for high-fidelity, privacy-safe **Synthetic Data**.

- **Local mode** trains and generates synthetic data locally on your own compute resources.
- **Client mode** connects to a remote MOSTLY AI platform for training & generating synthetic data there.
- Generators, that were trained locally, can be easily imported to a platform for further sharing.

## Overview

The SDK allows you to programmatically create, browse and manage 3 key resources:

1. **Generators** - Train a synthetic data generator on your existing tabular or language data assets
2. **Synthetic Datasets** - Use a generator to create any number of synthetic samples to your needs
3. **Connectors** - Connect to any data source within your organization, for reading and writing data

| Intent                                        | Primitive                         | Documentation                                                                                                     |
|-----------------------------------------------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------|
| Train a Generator on tabular or language data | `g = mostly.train(config)`        | see [mostly.train](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.train)       |
| Generate any number of synthetic data records | `sd = mostly.generate(g, config)` | see [mostly.generate](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.generate) |
| Live probe the generator on demand            | `df = mostly.probe(g, config)`    | see [mostly.probe](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.probe)       |
| Connect to any data source within your org    | `c = mostly.connect(config)`      | see [mostly.connect](https://mostly-ai.github.io/mostlyai/api_client/#mostlyai.sdk.client.api.MostlyAI.connect)   |

## Installation

**Client mode only**

```shell
pip install -U mostlyai
```

**Client + Local mode**

```shell
# for CPU on macOS
pip install -U 'mostlyai[local]'
# for CPU on Linux
#pip install -U mostlyai[local-cpu] --extra-index-url https://download.pytorch.org/whl/cpu
# for GPU on Linux
#pip install -U mostlyai[local-gpu]
```

**Optional Connectors**

Add any of the following extras for further data connectors support: `databricks`, `googlebigquery`, `hive`, `mssql`, `mysql`, `oracle`, `postgres`, `snowflake`.

E.g.
```shell
pip install -U 'mostlyai[local, databricks, snowflake]'
```

## Quick Start <a href="https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/getting-started/getting-started.ipynb" target="_blank"><img src="https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab" alt="Run on Colab"></a>

Generate your first samples based on your own trained generator with a few lines of code. For local mode, initialize the SDK with `local=True`. For client mode, initialize the SDK with `base_url` and `api_key` obtained from your [account settings page](https://app.mostly.ai/settings/api-keys).

```python
import pandas as pd
from mostlyai.sdk import MostlyAI

# load original data
repo_url = "https://github.com/mostly-ai/public-demo-data/raw/refs/heads/dev"
df_original = pd.read_csv(f"{repo_url}/census/census.csv.gz")

# initialize the SDK in local or client mode
mostly = MostlyAI(local=True)                       # local mode
# mostly = MostlyAI(base_url='xxx', api_key='xxx')  # client mode

# train a synthetic data generator
g = mostly.train(
    config={
        "name": "US Census Income",
        "tables": [
            {
                "name": "census",
                "data": df_original,
                "tabular_model_configuration": {  # tabular model configuration (optional)
                    "max_training_time": 1,  # - limit training time (in minutes)
                    # model, max_epochs,,..        # further model configurations (optional)
                    # 'differential_privacy': {    # differential privacy configuration (optional)
                    #     'max_epsilon': 5.0,      # - max epsilon value, used as stopping criterion
                    #     'delta': 1e-5,           # - delta value
                    # }
                },
                # columns, keys, compute,..      # further table configurations (optional)
            }
        ],
    },
    start=True,  # start training immediately (default: True)
    wait=True,  # wait for completion (default: True)
)

# display the quality assurance report
g.reports(display=True)
```

Once the generator has been trained, you can use it to generate synthetic data samples. Either via probing:

```python
# probe for some representative synthetic samples
df_samples = mostly.probe(g, size=100)
df_samples
```

or by creating a synthetic dataset entity for larger data volumes:

```python
# generate a large representative synthetic dataset
sd = mostly.generate(g, size=100_000)
df_synthetic = sd.data()
df_synthetic
```

or by conditionally probing / generating synthetic data:

```python
# create 100 seed records of 24y old Mexicans
df_seed = pd.DataFrame({
    'age': [24] * 100,
    'native_country': ['Mexico'] * 100,
})
# conditionally probe, based on provided seed
df_samples = mostly.probe(g, seed=df_seed)
df_samples
```

## Key Features

- **Broad Data Support**
    - Mixed-type data (categorical, numerical, geospatial, text, etc.)
    - Single-table, multi-table, and time-series
- **Multiple Model Types**
    - TabularARGN for SOTA tabular performance
    - Fine-tune HuggingFace-based language models
    - Efficient LSTM for text synthesis from scratch
- **Advanced Training Options**
    - GPU/CPU support
    - Differential Privacy
    - Progress Monitoring
- **Automated Quality Assurance**
    - Quality metrics for fidelity and privacy
    - In-depth HTML reports for visual analysis
- **Flexible Sampling**
    - Up-sample to any data volumes
    - Conditional generation by any columns
    - Re-balance underrepresented segments
    - Context-aware data imputation
    - Statistical fairness controls
    - Rule-adherence via temperature
- **Seamless Integration**
    - Connect to external data sources (DBs, cloud storages)
    - Fully permissive open-source license

## Citation

Please consider citing our project if you find it useful:

```bibtex
@software{mostlyai,
    author = {{MOSTLY AI}},
    title = {{MOSTLY AI SDK}},
    url = {https://github.com/mostly-ai/mostlyai},
    year = {2025}
}
```
