Metadata-Version: 2.4
Name: docs2synth
Version: 1.0.3
Summary: A Python package for synthesizing and working with document data.
Author: AI4WA
License-Expression: MIT
Project-URL: Homepage, https://github.com/AI4WA/Docs2Synth
Project-URL: Bug Tracker, https://github.com/AI4WA/Docs2Synth/issues
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: General
Classifier: Operating System :: OS Independent
Requires-Python: <3.13,>=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.0
Requires-Dist: pyyaml
Requires-Dist: pandas>=1.5.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: tqdm>=4.65.0
Requires-Dist: requests>=2.28.0
Requires-Dist: typing_extensions>=4.5
Provides-Extra: cpu
Requires-Dist: torch>=2.0.0; extra == "cpu"
Requires-Dist: torchvision>=0.15.0; extra == "cpu"
Requires-Dist: torchaudio>=2.0.0; extra == "cpu"
Requires-Dist: transformers>=4.30.0; extra == "cpu"
Requires-Dist: sentence-transformers>=0.6; extra == "cpu"
Requires-Dist: faiss-cpu>=1.8.0; extra == "cpu"
Requires-Dist: numpy<2.0,>=1.22; extra == "cpu"
Requires-Dist: paddlepaddle<=3.2.1,>=3.2.0; extra == "cpu"
Requires-Dist: paddleocr<3.4.0,>=3.2.0; extra == "cpu"
Requires-Dist: shapely>=1.8; extra == "cpu"
Requires-Dist: scikit-image>=0.21; extra == "cpu"
Requires-Dist: easyocr>=1.7.0; extra == "cpu"
Requires-Dist: pdfplumber>=0.10.0; extra == "cpu"
Requires-Dist: pymupdf>=1.23.0; extra == "cpu"
Requires-Dist: docling>=2.61.1; extra == "cpu"
Requires-Dist: openai>=1.0; extra == "cpu"
Requires-Dist: anthropic>=0.7.0; extra == "cpu"
Requires-Dist: google-generativeai>=0.3.0; extra == "cpu"
Requires-Dist: streamlit>=1.51.0; extra == "cpu"
Requires-Dist: plotly>=6.4.0; extra == "cpu"
Requires-Dist: matplotlib>=3.10.7; extra == "cpu"
Requires-Dist: pyarrow>=10.0.0; extra == "cpu"
Requires-Dist: mcp>=1.20.0; extra == "cpu"
Requires-Dist: starlette>=0.27.0; extra == "cpu"
Requires-Dist: uvicorn>=0.27.0; extra == "cpu"
Requires-Dist: pyjwt[crypto]>=2.8.0; extra == "cpu"
Requires-Dist: httpx>=0.27.0; extra == "cpu"
Requires-Dist: sse-starlette>=2.0.0; extra == "cpu"
Provides-Extra: gpu
Requires-Dist: vllm>=0.11.0; extra == "gpu"
Requires-Dist: faiss-cpu>=1.8.0; extra == "gpu"
Requires-Dist: transformers>=4.30.0; extra == "gpu"
Requires-Dist: sentence-transformers>=0.6; extra == "gpu"
Requires-Dist: numpy<2.0,>=1.22; extra == "gpu"
Requires-Dist: paddlepaddle<=3.2.1,>=3.2.0; extra == "gpu"
Requires-Dist: paddleocr<3.4.0,>=3.2.0; extra == "gpu"
Requires-Dist: shapely>=1.8; extra == "gpu"
Requires-Dist: scikit-image>=0.21; extra == "gpu"
Requires-Dist: easyocr>=1.7.0; extra == "gpu"
Requires-Dist: pdfplumber>=0.10.0; extra == "gpu"
Requires-Dist: pymupdf>=1.23.0; extra == "gpu"
Requires-Dist: docling>=2.61.1; extra == "gpu"
Requires-Dist: openai>=1.0; extra == "gpu"
Requires-Dist: anthropic>=0.7.0; extra == "gpu"
Requires-Dist: google-generativeai>=0.3.0; extra == "gpu"
Requires-Dist: streamlit>=1.51.0; extra == "gpu"
Requires-Dist: plotly>=6.4.0; extra == "gpu"
Requires-Dist: matplotlib>=3.10.7; extra == "gpu"
Requires-Dist: pyarrow>=10.0.0; extra == "gpu"
Requires-Dist: mcp>=1.20.0; extra == "gpu"
Requires-Dist: starlette>=0.27.0; extra == "gpu"
Requires-Dist: uvicorn>=0.27.0; extra == "gpu"
Requires-Dist: pyjwt[crypto]>=2.8.0; extra == "gpu"
Requires-Dist: httpx>=0.27.0; extra == "gpu"
Requires-Dist: sse-starlette>=2.0.0; extra == "gpu"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: black==24.8.0; extra == "dev"
Requires-Dist: isort>=5.12; extra == "dev"
Requires-Dist: flake8>=6.0; extra == "dev"
Requires-Dist: mkdocs>=1.5.0; extra == "dev"
Requires-Dist: mkdocs-material>=9.0.0; extra == "dev"
Requires-Dist: ipykernel>=7.0.0; extra == "dev"
Requires-Dist: nbconvert>=7.0.0; extra == "dev"
Requires-Dist: jupyter; extra == "dev"
Dynamic: license-file

# Docs2Synth

[![Documentation](https://img.shields.io/badge/docs-mkdocs-blue.svg)](https://ai4wa.github.io/Docs2Synth/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)

**Docs2Synth** converts, synthesizes, and trains retrievers for document datasets.

## Workflow

```
Documents → Preprocess → QA Generation → Verification →
Human Annotation → Retriever Training → RAG Deployment
```

### 🚀 Quick Start: Automated Pipeline

Run the complete end-to-end pipeline with a single command:

```bash
docs2synth run
```

This automatically chains: preprocessing → QA generation → verification → retriever training → validation → RAG deployment, skipping the manual annotation UI.

### Manual Step-by-Step Workflow

For more control, run each step individually:

```bash
# 1. Preprocess documents
docs2synth preprocess data/raw/my_documents/

# 2. Generate QA pairs
docs2synth qa batch

# 3. Verify quality
docs2synth verify batch

# 4. Annotate (opens UI)
docs2synth annotate

# 5. Train retriever
docs2synth retriever preprocess
docs2synth retriever train --mode standard --lr 1e-5 --epochs 10

# 6. Deploy RAG
docs2synth rag ingest
docs2synth rag app
```

[Complete Workflow Guide →](https://ai4wa.github.io/Docs2Synth/workflow/complete-workflow/)

---

## Installation

### PyPI Installation (Recommended)

**CPU Version (includes all features + MCP server):**
```bash
pip install docs2synth[cpu]
```

**GPU Version (includes all features + MCP server):**
```bash
# First install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# Then install Docs2Synth with GPU extras
pip install docs2synth[gpu]
```

**Minimal Install (CLI only, no ML/MCP features):**
```bash
pip install docs2synth
```

### Development Setup

**Use the setup script (installs uv + dependencies automatically):**

```bash
# Clone
git clone https://github.com/AI4WA/Docs2Synth.git
cd Docs2Synth

# Run setup script
./setup.sh         # Unix/macOS/WSL
# setup.bat        # Windows
```

The script:
- Installs [uv](https://github.com/astral-sh/uv) (fast package manager)
- Creates virtual environment
- Installs dependencies (CPU or GPU)
- Sets up config

**Manual development setup:**

```bash
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh  # Unix/macOS
# powershell -c "irm https://astral.sh/uv/install.ps1 | iex"  # Windows

# Clone and setup
git clone https://github.com/AI4WA/Docs2Synth.git
cd Docs2Synth
uv venv
source .venv/bin/activate  # .venv\Scripts\activate on Windows

# Install for development
uv pip install -e ".[cpu,dev]"  # or [gpu,dev] for GPU

# Setup config
cp config.example.yml config.yml
# Edit config.yml and add your API keys
```

---

## Features

- **Document Processing**: Extract text/layout with Docling, PaddleOCR, PDFPlumber
- **QA Generation**: Automatic question-answer pair generation with LLMs
- **Verification**: Built-in meaningful and correctness verifiers
- **Human Annotation**: Streamlit UI for manual review
- **Retriever Training**: Train LayoutLMv3-based retrievers
- **RAG Deployment**: Deploy with naive or iterative strategies
- **MCP Integration**: Expose as Model Context Protocol server

---

## Configuration

Create `config.yml` from `config.example.yml`:

```yaml
# API keys (config.yml is in .gitignore)
agent:
  keys:
    openai_api_key: "sk-..."
    anthropic_api_key: "sk-ant-..."

# Document processing
preprocess:
  processor: docling
  input_dir: ./data/raw/
  output_dir: ./data/processed/

# QA generation
qa:
  strategies:
    - strategy: semantic
      provider: openai
      model: gpt-4o-mini

# Retriever training
retriever:
  learning_rate: 1e-5
  epochs: 10

# RAG
rag:
  embedding:
    model: sentence-transformers/all-MiniLM-L6-v2
```

---

## Docker

```bash
# CPU
./scripts/build-docker.sh cpu

# GPU
./scripts/build-docker.sh gpu
```

See [Docker Builds](https://ai4wa.github.io/Docs2Synth/development/docker-builds/)

---

## Documentation

Full documentation: **https://ai4wa.github.io/Docs2Synth/**

- [Complete Workflow Guide](https://ai4wa.github.io/Docs2Synth/workflow/complete-workflow/)
- [CLI Reference](https://ai4wa.github.io/Docs2Synth/cli-reference/)
- [Document Processing](https://ai4wa.github.io/Docs2Synth/workflow/document-processing/)
- [QA Generation](https://ai4wa.github.io/Docs2Synth/workflow/qa-generation/)
- [Retriever Training](https://ai4wa.github.io/Docs2Synth/workflow/retriever-training/)
- [RAG Deployment](https://ai4wa.github.io/Docs2Synth/workflow/rag-path/)

---

## Contributing

We welcome contributions! Please:

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Run tests: `pytest tests/ -v`
5. Run code quality checks: `./scripts/check.sh`
6. Submit a pull request

See [Dependency Management](https://ai4wa.github.io/Docs2Synth/development/dependency-management/) for dev setup details.

---

## License

MIT License - see [LICENSE](LICENSE) file for details.

---

## Citation

If you use Docs2Synth in your research, please cite:

```bibtex
@software{docs2synth2024,
  title = {Docs2Synth: Document Processing and Retriever Training},
  author = {AI4WA Team},
  year = {2024},
  url = {https://github.com/AI4WA/Docs2Synth}
}
```

---

## Support

- **Documentation**: https://ai4wa.github.io/Docs2Synth/
- **Issues**: https://github.com/AI4WA/Docs2Synth/issues
- **Discussions**: https://github.com/AI4WA/Docs2Synth/discussions
