Metadata-Version: 2.4
Name: mosaicx
Version: 1.0.6
Summary: Medical cOmputational Suite for Advanced Intelligent eXtraction
Project-URL: Homepage, https://github.com/LalithShiyam/MOSAICX
Project-URL: Repository, https://github.com/LalithShiyam/MOSAICX
Project-URL: Documentation, https://github.com/LalithShiyam/MOSAICX#readme
Project-URL: Bug Tracker, https://github.com/LalithShiyam/MOSAICX/issues
Author-email: Lalith Kumar Shiyam Sundar <lalith.shiyam@med.uni-muenchen.de>
License: DUAL LICENSING NOTICE
        ====================
        
        MOSAICX is dual-licensed under the terms of both the GNU Affero General Public License v3.0 (AGPL-3.0) and a Commercial License.
        
        OPEN SOURCE LICENSE
        ===================
        
        This software is available under the GNU Affero General Public License v3.0 (AGPL-3.0).
        
        Under this license, you are free to use, modify, and distribute this software, provided that:
        - Any derivative work or application that uses this software must also be open-sourced under AGPL-3.0
        - If you run this software on a server and provide it as a service, you must make the complete source code of your application (including modifications) available to your users
        - You must include this license notice and copyright information in all copies
        
        For the complete AGPL-3.0 license terms, see LICENSE-AGPL-3.0.txt
        
        COMMERCIAL LICENSE
        ==================
        
        If you wish to use this software in a commercial product or service without the open-source requirements of AGPL-3.0, you must obtain a commercial license.
        
        Commercial licenses are available from:
        
            Zenta GmbH
            
            For commercial licensing inquiries, please contact:
            Email: info@zenta.solutions
            Subject: MOSAICX Commercial License Request
        
        Commercial licensing allows you to:
        - Use this software in proprietary applications
        - Distribute applications containing this software without open-source obligations
        - Customize and modify the software without sharing changes
        - Receive commercial support and maintenance
        
        COPYRIGHT AND ATTRIBUTION
        ==========================
        
        Copyright (c) 2024 DIGITX Lab, Department of Radiology, LMU Munich University Hospital
        Developed by Lalith Kumar Shiyam Sundar, PhD
        
        Commercial licensing managed by Zenta GmbH
        
        IMPORTANT NOTICE
        ================
        
        By using this software, you agree to comply with the terms of one of the above licenses.
        If you are unsure which license applies to your use case, please contact Zenta GmbH for clarification.
License-File: LICENSE
Keywords: extraction,llm,medical,nlp,pdf,radiology
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Requires-Dist: click>=8.1.0
Requires-Dist: docling>=2.0.0
Requires-Dist: dspy-ai>=2.4.0
Requires-Dist: httpx>=0.25.0
Requires-Dist: instructor>=1.0.0
Requires-Dist: ollama>=0.3.0
Requires-Dist: openai>=1.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: python-cfonts>=1.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: reportlab>=4.4.4
Requires-Dist: rich-click>=1.8.0
Requires-Dist: rich>=13.0.0
Requires-Dist: typing-extensions>=4.8.0
Provides-Extra: dev
Requires-Dist: black>=23.7.0; extra == 'dev'
Requires-Dist: isort>=5.12.0; extra == 'dev'
Requires-Dist: mypy>=1.5.0; extra == 'dev'
Requires-Dist: pre-commit>=3.3.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.0.280; extra == 'dev'
Description-Content-Type: text/markdown


<div align="center">
  <img src="assets/mosaicx_logo.png" alt="MOSAICX Logo" width="800"/>
</div>

<p align="center">
  <a href="https://pypi.org/project/mosaicx/"><img alt="PyPI" src="https://img.shields.io/pypi/v/mosaicx.svg?label=PyPI&style=for-the-badge&logo=python&logoColor=white&color=bd93f9"></a>
  <a href="https://www.python.org/downloads/"><img alt="Python" src="https://img.shields.io/badge/Python-3.11%2B-50fa7b?style=for-the-badge&logo=python&logoColor=white"></a>
  <a href="https://www.gnu.org/licenses/agpl-3.0"><img alt="License" src="https://img.shields.io/badge/License-AGPL--3.0-ff79c6?style=for-the-badge&logo=gnu&logoColor=white"></a>
  <a href="https://pepy.tech/project/mosaicx"><img alt="Downloads" src="https://img.shields.io/pepy/dt/mosaicx?style=for-the-badge&color=8be9fd&label=Downloads"></a>
  <a href="https://pydantic.dev"><img alt="Pydantic v2" src="https://img.shields.io/badge/Pydantic-v2-ffb86c?style=for-the-badge&logo=pydantic&logoColor=white"></a>
  <a href="https://ollama.ai"><img alt="Ollama Compatible" src="https://img.shields.io/badge/Ollama-Compatible-6272a4?style=for-the-badge&logo=ghost&logoColor=white"></a>
</p>

## The Foundation of Healthcare Data Transformation

MOSAICX represents the first intentional step toward building a living, breathing ecosystem for healthcare data intelligence. At DIGIT-X Lab, we believe that meaningful clinical insights begin with one fundamental challenge: **structuring the unstructured**.

Healthcare data is notoriously fragmented, locked in silos, and resistant to systematic analysis. MOSAICX addresses this by transforming unstructured medical documents into validated, interoperable data structures that become the foundation for connected healthcare intelligence.

**From this:**
```
"Pat.-Nr.: 111111111, geb. 13.03.1940, Müller, Jane
Transthorakale Echokardiographie vom 06.10.2020 10:45
Befund: Mitralklappe physiologische Insuffizienz..."
```

**To this:**
```json
{
  "patient_id": "111111111",
  "age": 80,
  "sex": "Female", 
  "mitral_valve_grade": "Normal",
  "tricuspid_valve_grade": "Mild"
}
```

This is not just data extraction. This is infrastructure—the systematic foundation that enables healthcare data to move, connect, and speak in a common language that both humans and AI can understand.

## 🚀 **Quick Start

### Prerequisites

**⚠️ IMPORTANT: You need Ollama running locally before using MOSAICX**

MOSAICX uses local LLMs through Ollama for privacy and control. Install Ollama first:

**🖥️ Install Ollama:**
```bash
# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or visit https://ollama.com/download for Windows/GUI installers
```

**🤖 Install a compatible model:**
```bash
# Recommended: Fast and reliable for medical text
ollama pull mistral

# Alternative: Larger model for complex cases  
ollama pull llama3.1

# Verify Ollama is running
ollama list
```

### Installation

**Option 1: Standard Installation**
```bash
pip install mosaicx
```

**Option 2: With UV (Faster & Better)**
```bash
uv add mosaicx
```

### Verification
```bash
# Check if everything works
mosaicx --help

# Test with your installed model
mosaicx generate --desc "Patient demographics" --model mistral
```

### Complete Workflow Guide

**🔄 The MOSAICX Process: From Description to Data**

```bash
# Step 1: Generate your schema from natural language
mosaicx generate --desc "Echocardiography report with patient ID, age, sex, and all cardiac valve conditions including severity grades" --model mistral

# Step 2: See what schemas you have
mosaicx schemas

# Step 3: Extract data from your PDF
mosaicx extract --pdf echo_report.pdf --schema EchocardiographyReport --model mistral --save results.json

# Step 4: Repeat for more files
mosaicx extract --pdf another_report.pdf --schema EchocardiographyReport --model mistral
```

**💡 Pro Tips for Better Results:**

### 📝 **Writing Effective Descriptions**

**✅ DO:**
```bash
# Be specific about data types and medical domains
mosaicx generate --desc "Patient demographics (ID, age, sex, birth_date) plus cardiac valve assessment from echocardiography with severity grades (Normal, Mild, Moderate, Severe) for mitral, aortic, tricuspid, and pulmonary valves"

# Include expected value formats  
mosaicx generate --desc "Lab results with numeric values: hemoglobin (g/dL), white blood cell count (cells/µL), platelet count (K/µL), and categorical blood type (A, B, AB, O with +/- Rh factor)"

# Mention the document source type
mosaicx generate --desc "Radiology report findings from MRI brain scans including lesion locations, sizes in mm, enhancement patterns, and radiologist impression"
```

**❌ DON'T:**
```bash
# Too vague - will generate generic fields
mosaicx generate --desc "patient data"

# Missing context - unclear what values to expect  
mosaicx generate --desc "heart information from reports"

# No data types - will guess incorrectly
mosaicx generate --desc "some medical stuff from PDFs"
```

### 🎯 **Schema Management Workflow**

```bash
# List all your generated schemas
mosaicx schemas

# Extract using different schema formats:
mosaicx extract --pdf report.pdf --schema EchoReport_20250919_143022        # Schema ID
mosaicx extract --pdf report.pdf --schema echoreport_patient_cardiac.py     # Filename  
mosaicx extract --pdf report.pdf --schema /path/to/schema/echoreport.py     # Full path
```

### 🔍 **Real Example: Complete Workflow**

Let's say you have echocardiography reports and want to extract structured data:

```bash
# 1. Generate schema with detailed description
mosaicx generate \
  --desc "Echocardiography report containing patient_id (string), age (integer), sex (Male/Female), exam_date (YYYY-MM-DD format), mitral_valve_condition (Normal/Mild/Moderate/Severe), aortic_valve_condition (Normal/Mild/Moderate/Severe), ejection_fraction (percentage as integer), and clinical_impression (free text)" \
  --model mistral

# Output: Created schema 'EchocardiographyReport_20250919_143022'

# 2. Verify your schema was created
mosaicx schemas
# Shows: EchocardiographyReport_20250919_143022 | Echocardiography report containing patient demographics and valve conditions

# 3. Extract from your first report
mosaicx extract \
  --pdf patient_001_echo.pdf \
  --schema EchocardiographyReport_20250919_143022 \
  --model mistral \
  --save patient_001_results.json

# 4. Extract from multiple reports (same schema)
mosaicx extract --pdf patient_002_echo.pdf --schema EchocardiographyReport_20250919_143022 --model mistral
mosaicx extract --pdf patient_003_echo.pdf --schema EchocardiographyReport_20250919_143022 --model mistral

# 5. Results saved as structured JSON ready for analysis!
```

**🎯 Pro Tip:** If you see "connection refused" errors, make sure Ollama is running: `ollama serve` (usually runs automatically after installation)

---

## Building the Foundation for Connected Healthcare Intelligence

### **The Challenge We Address**

Healthcare generates vast amounts of data, yet most of it remains trapped in unstructured formats that resist systematic analysis. At DIGIT-X Lab, we recognized that before we can build intelligent systems that truly serve clinical care, we must solve the fundamental problem of **data structure**.

This is where meaningful healthcare AI begins—not with complex models, but with the infrastructure that allows medical data to be:
- **Structured** systematically and consistently
- **Connected** across clinical silos and departments  
- **Validated** against clinical standards and requirements
- **Transformed** into knowledge that supports decision-making

### **Our Vision: A Living Ecosystem**

MOSAICX is the first building block in our vision of a living, breathing ecosystem for healthcare data intelligence. We believe that when medical data is properly structured, insights should emerge as "free lunches"—natural byproducts of well-designed systems rather than labor-intensive manual processes.

**The ecosystem we're building:**

- �️ **Infrastructure-First**: Reliable, reproducible data transformation that scales
- 🔗 **Interconnected**: Systems that speak to each other in meaningful ways
- 🩺 **Clinically Grounded**: Tools built from real clinical needs, not theoretical possibilities
- � **Transparent**: AI that can explain its reasoning in natural language
- 🌱 **Evolving**: Adaptive systems that learn and improve with use

### **Why Structure Comes First**

In radiology—and healthcare broadly—the most sophisticated AI model is only as good as the data it can understand. Unstructured medical reports, imaging findings, and clinical notes represent decades of medical knowledge locked away from systematic analysis.

By solving the structuring problem first, we create the foundation for:
- **Longitudinal patient analysis** across multiple reports and timepoints
- **Cross-modal integration** linking imaging, lab results, and clinical notes
- **Population-level insights** drawn from properly standardized clinical data
- **Reproducible research** built on validated, structured datasets

MOSAICX transforms the labor-intensive process of manual data extraction into systematic, validated infrastructure that serves as the backbone for more sophisticated healthcare intelligence systems.

---

## Infrastructure for Medical Data Transformation

### **Systematic Data Pipeline**
```
📝 Clinical Description → 🤖 Local LLM → 🏗️ Validated Schema → 📄 Medical Document → 🔍 Structured Extraction → ✨ Interoperable Data
```

MOSAICX implements a systematic approach to medical data transformation that prioritizes reproducibility, clinical validity, and infrastructure scalability.

### **Schema-Driven Architecture**

**1. Intelligent Schema Generation** 
```bash
mosaicx generate --desc "Echocardiography report with comprehensive cardiac valve assessment" --model mistral
```

Our approach begins with **clinical intent**: you describe your data requirements in natural language, and MOSAICX generates validated Pydantic schemas that encode both structure and clinical semantics.

**Generated Infrastructure:**
```python
class EchocardiographyReport(BaseModel):
    patient_id: str = Field(description="Patient identifier")
    age: int = Field(description="Patient age in years", ge=0, le=150)
    mitral_valve_condition: Literal["Normal", "Mild", "Moderate", "Severe"] = Field(
        description="Mitral valve condition severity grade"
    )
    ejection_fraction: Optional[int] = Field(
        description="Left ventricular ejection fraction percentage",
        ge=15, le=80
    )
```

This is not just code generation—it's **clinical data infrastructure** with built-in validation, semantic clarity, and interoperability standards.

**2. Systematic Data Extraction**
```bash  
mosaicx extract --pdf echo_report.pdf --schema EchocardiographyReport --model mistral --save structured_data.json
```

**The extraction process prioritizes:**
- **Clinical accuracy** through schema-guided parsing
- **Data validation** with automatic constraint checking  
- **Reproducibility** via consistent local processing
- **Transparency** in how clinical concepts map to structured fields

**Structured Output:**
```json
{
  "patient_id": "0022768653",
  "age": 78,
  "mitral_valve_condition": "Mild", 
  "ejection_fraction": 55,
  "extraction_metadata": {
    "schema_version": "EchocardiographyReport_20250919_143022",
    "processing_timestamp": "2025-09-19T14:30:22Z",
    "validation_status": "PASSED"
  }
}
```

**3. Infrastructure for Scale**
```bash
mosaicx schemas  # Manage your data infrastructure assets
```

MOSAICX treats schemas as **infrastructure assets**—versioned, reusable, and designed for systematic deployment across clinical workflows.

**Key Infrastructure Principles:**
- **Local Processing**: All data transformation happens within your clinical environment
- **Schema Versioning**: Systematic tracking of data structure evolution
- **Validation Pipeline**: Built-in quality control for extracted data
- **Interoperability**: Standards-compliant output ready for downstream systems

**🔒 Privacy Note:** All LLM processing happens locally through Ollama. Your patient data never touches the internet.

---

## 🎓 **Advanced Usage & Best Practices**

### **🏆 Expert Tips for Schema Descriptions**

**📋 Template for Medical Reports:**
```bash
mosaicx generate --desc "[Document Type] containing [Demographics] (specify fields and types), [Clinical Data] (with units and value ranges), [Categorical Fields] (list expected values), and [Free Text Fields] (impression, notes, etc.)"
```

**🩺 Real-World Examples by Medical Domain:**

**Radiology Reports:**
```bash
mosaicx generate --desc "Brain MRI report with patient_id (string), scan_date (YYYY-MM-DD), sequence_types (T1, T2, FLAIR, DWI), lesion_count (integer), lesion_locations (frontal, parietal, temporal, occipital), largest_lesion_size_mm (float), enhancement_pattern (none, ring, nodular), and radiologist_impression (free text)"
```

**Laboratory Results:**
```bash  
mosaicx generate --desc "Complete blood count results with patient_id (string), test_date (YYYY-MM-DD), hemoglobin_g_per_dl (float 10-18), hematocrit_percentage (float 30-50), white_blood_cells_per_ul (integer 4000-11000), platelet_count_k_per_ul (integer 150-450), and reference_ranges_exceeded (list of abnormal values)"
```

**Pathology Reports:**
```bash
mosaicx generate --desc "Surgical pathology report including patient_id (string), specimen_type (biopsy, resection, cytology), organ_system (breast, lung, colon, etc.), diagnosis_primary (free text), diagnosis_secondary (free text), tumor_size_cm (float), grade (1-4 or well/moderate/poor), stage_tnm (TNM format), margins_status (positive, negative, close), and pathologist_name (string)"
```

### **⚡ Performance Optimization**

**Model Selection Strategy:**
```bash
# Fast processing for simple extractions
mosaicx extract --pdf simple_report.pdf --schema SimpleSchema --model mistral

# Better accuracy for complex medical terminology  
mosaicx extract --pdf complex_report.pdf --schema ComplexSchema --model llama3.1

# Batch processing with consistent model
for pdf in reports/*.pdf; do
    mosaicx extract --pdf "$pdf" --schema MedicalReport --model mistral --save "results/$(basename "$pdf" .pdf).json"
done
```

### **🔍 Quality Control & Validation**

**Iterative Schema Refinement:**
```bash
# Start with basic schema
mosaicx generate --desc "Basic patient demographics and primary diagnosis"

# Test on a few files, then refine
mosaicx generate --desc "Patient demographics (patient_id as 8-digit number, age 0-120, sex as M/F only) plus primary_diagnosis (ICD-10 format preferred), secondary_diagnoses (list), and admission_date (MM/DD/YYYY format)"
```

**Result Verification:**
```bash
# Save results for manual review
mosaicx extract --pdf report.pdf --schema MySchema --model mistral --save results_to_review.json

# Use debug mode for problematic extractions
mosaicx extract --pdf problematic.pdf --schema MySchema --model mistral --debug
```

---

## Clinical-Grade Features for Healthcare Data Infrastructure

### **🏥 Medical Domain Intelligence**
- **Clinical terminology mapping**: Systematic translation of medical abbreviations and multilingual terms
- **Standardized value sets**: Automatic mapping to clinical standards (ICD-10, SNOMED CT concepts)
- **Contextual understanding**: Distinction between normal findings, pathology, and uncertainty expressions

### **� Privacy-First Architecture**
- **Local processing**: All data transformation occurs within your clinical environment
- **No cloud dependencies**: Patient data never leaves your institutional network
- **GDPR compliance**: Built-in privacy protection by design for European healthcare environments

### **⚕️ Clinical Workflow Integration**
```
📊 Structured Extraction Results → Clinical Dashboard

┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Clinical Parameter       ┃ Extracted Value                 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Patient ID               │ 0022768653                      │
│ Cardiac Function         │ Preserved (EF: 55%)             │
│ Valve Assessment         │ Mild mitral insufficiency       │
│ Clinical Significance    │ Routine follow-up recommended   │
└──────────────────────────┴─────────────────────────────────┘
```

### **� Infrastructure Reliability**
- **Schema versioning**: Systematic tracking of data structure evolution for longitudinal studies
- **Validation pipeline**: Multi-layer quality control ensuring clinical data integrity
- **Error recovery**: Robust handling of edge cases common in real-world medical documents
- **Audit trails**: Complete tracking of data transformation processes for clinical research compliance

## 🔧 **Troubleshooting**

### **Common Issues & Solutions**

**❌ "Connection refused" or "Model not found" errors:**
```bash
# Check if Ollama is running
ollama list

# If not installed, install it:
curl -fsSL https://ollama.com/install.sh | sh

# If installed but not running:
ollama serve

# Check available models:
ollama list
```

**❌ "No module named 'mosaicx'" error:**
```bash
# Make sure you're in the right environment
pip show mosaicx

# If not installed:
pip install mosaicx
```

**❌ PDF extraction fails:**
```bash
# Install additional dependencies if needed
pip install "mosaicx[pdf]"  # (if this exists)

# Check if PDF is readable:
file your_pdf.pdf
```

**❌ Schema generation produces weird results:**
```bash
# Try a different model
mosaicx generate --desc "your description" --model llama3.1

# Use more specific descriptions
mosaicx generate --desc "Patient demographics including age, sex, ID, and cardiac valve conditions from echocardiography reports"
```

**🆘 Still stuck?** Open an issue with:
- Your OS and Python version
- Ollama version (`ollama --version`)
- The exact error message
- A minimal example (anonymized data!)

---

## The Larger Vision

### **DIGIT-X Lab: Building Connected Healthcare Intelligence**

MOSAICX represents the foundational layer of our vision for healthcare data transformation. At DIGIT-X Lab in the Department of Radiology at LMU Klinikum, we're working systematically toward healthcare systems that are:

- **Interconnected**: Where imaging, laboratory, and clinical data speak a common structured language
- **Intelligent**: Where AI operates transparently on validated, structured clinical information  
- **Infrastructure-Focused**: Where data transformation is reliable, reproducible, and clinically grounded
- **Clinically Useful**: Where technology serves real healthcare needs rather than abstract possibilities

**What comes next:**
- **Cross-modal integration**: Linking structured imaging findings with clinical notes and laboratory results
- **Longitudinal patient models**: Systematic tracking of clinical changes over time using structured data
- **Clinical decision infrastructure**: AI systems that can explain their reasoning using structured clinical evidence
- **Interoperable ecosystems**: Healthcare data that moves seamlessly between institutions and systems

### **Research Collaboration**

DIGIT-X Lab actively collaborates with clinical researchers, healthcare informaticists, and AI developers who share our infrastructure-first approach to healthcare intelligence.

**Contact for research collaboration:**
- **Email**: lalith.shiyam@med.uni-muenchen.de  
- **Institution**: DIGIT-X Lab, Department of Radiology, LMU Klinikum, Munich

### **Citation**

If MOSAICX contributes to your research or clinical work, please cite:

```bibtex
@software{mosaicx2025,
  title={MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},
  author={Shiyam Sundar, Lalith Kumar and DIGIT-X Lab Team},
  year={2025},
  url={https://github.com/LalithShiyam/MOSAICX},
  institution={DIGIT-X Lab, Department of Radiology, LMU Klinikum},
  note={Infrastructure for systematic healthcare data transformation}
}
```

---

## Acknowledgments

**Core Infrastructure:**
- **Ollama**: Local LLM deployment enabling privacy-preserving clinical data processing
- **Docling**: Advanced document parsing specifically designed for complex medical document formats
- **Pydantic**: Data validation ensuring clinical data integrity and interoperability
- **Rich**: Clear, accessible interfaces for clinical workflow integration

**Clinical Collaboration:**
- **LMU Klinikum clinical departments**: Providing real-world validation environments and use cases
- **DIGIT-X Lab team**: Systematic development and validation of healthcare data infrastructure
- **International collaborators**: Multi-site validation and cross-institutional deployment

---

*Building healthcare data infrastructure, one systematic transformation at a time.*

**DIGIT-X Lab**  
Digital Transformation in Radiology  
Department of Radiology  
LMU Klinikum, Munich
